Machine Learning Performance Engineer

May 21

🏢 In-office - Manhattan

Apply Now
Logo of Jane Street

Jane Street

Jane Street works differently$1. .$1

1001 - 5000

Description

• We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team. • Machine learning is a critical pillar of Jane Street's global business. Our ever-evolving trading environment serves as a unique, rapid-feedback platform for ML experimentation, allowing us to incorporate new ideas with relatively little friction. • Your part here is optimising the performance of our models - both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level - is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long? • If you've never thought about a career in finance, you're in good company. Many of us were in the same position before working here. If you have a curious mind and a passion for solving interesting problems, we have a feeling you'll fit right in. • There's no fixed set of skills, but here are some of the things we're looking for:

Requirements

• An understanding of modern ML techniques and toolsets • The experience and systems knowledge required to debug a training run’s performance end to end • Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy • Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS • Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters • An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI • An inventive approach and the willingness to ask hard questions about whether we’re taking the right approaches and using the right tools • Fluency in English

Benefits

• An understanding of modern ML techniques and toolsets • The experience and systems knowledge required to debug a training run’s performance end to end • Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy • Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS • Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters • An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI • An inventive approach and the willingness to ask hard questions about whether we’re taking the right approaches and using the right tools • Fluency in English

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@techjobsnewyorkcity.com