## ML/AI Engineer - GPU Platform Benchmarking
Nebius is seeking a highly skilled ML/AI Engineer to lead and support benchmarking of GPU platforms for machine learning and AI workloads. You will play a critical role in evaluating the performance of GPU-based hardware for various deep learning and AI frameworks, enabling data-driven decisions for platform optimisation and next-generation hardware development.
### Responsibilities:
- Work closely with hardware and development teams to profile and analyse GPU performance at the system and kernel level
- Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g., CUDA, ROCm)
- Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks
- Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads
- Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability
- Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends
- Contribute to internal tooling, frameworks, and best practices
### Required Experience:
- Profound understanding of theoretical foundations of machine learning
- Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.)
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensor-LLM)
- Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries
- Familiarity with containerized environments (e.g., Docker, Kubernetes)
- Strong communication and ability to work independently
### Ways to Stand Out:
- Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT)
- Experience in Python and performance profiling tools (e.g., Nsight, nvprof, perf)
- Familiarity with cloud ML platforms like AWS, GCP, Azure ML
- Contributions to open-source ML benchmarking tools