We are seeking a highly skilled ML/AI Engineer to join our team to lead and support benchmarking of GPU platforms for machine learning and AI workloads. You will play a critical role in evaluating the performance of GPU-based hardware for various deep learning and AI frameworks, enabling data-driven decisions for platform optimization and next-generation hardware development.
## Responsibilities
- Work closely with hardware and development teams to profile and analyze GPU performance at the system and kernel level.
- Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g., CUDA, ROCm).
- Debug and optimize ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
- Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
- Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimizations on performance and scalability.
- Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
- Contribute to internal tooling, frameworks, and best practices.
## Requirements
- A profound understanding of the theoretical foundations of machine learning.
- Deep understanding of performance aspects of large neural network training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimizations, dynamic batching, etc.).
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, TensorRT-LLM).
- Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries.
- Familiarity with containerized environments (e.g., Docker, Kubernetes).
- Strong communication skills and ability to work independently.
## Preferred Qualifications
- Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT).
- Experience with Python and performance profiling tools (e.g., Nsight, nvprof, perf).
- Familiarity with cloud ML platforms like AWS, GCP, Azure ML.
- Contributions to open-source ML benchmarking tools.