StartupsEventsJobsNewsTV
dutchstartup.ai
EventsJobsNewsTV

Job opening

ML Infrastructure Engineer

Full-time · Posted 15 Jun 2026

Apply now
The roleThe companyMore jobsSimilar
01

What you will do

About this role

We are seeking a highly skilled ML/AI Engineer to join our team to lead and support benchmarking of GPU platforms for machine learning and AI workloads. You will play a critical role in evaluating the performance of GPU-based hardware for various deep learning and AI frameworks, enabling data-driven decisions for platform optimization and next-generation hardware development.

## Responsibilities

  • Work closely with hardware and development teams to profile and analyze GPU performance at the system and kernel level.
  • Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g., CUDA, ROCm).
  • Debug and optimize ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
  • Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
  • Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimizations on performance and scalability.
  • Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
  • Contribute to internal tooling, frameworks, and best practices.

## Requirements

  • A profound understanding of the theoretical foundations of machine learning.
  • Deep understanding of performance aspects of large neural network training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimizations, dynamic batching, etc.).
  • Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, TensorRT-LLM).
  • Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries.
  • Familiarity with containerized environments (e.g., Docker, Kubernetes).
  • Strong communication skills and ability to work independently.

## Preferred Qualifications

  • Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT).
  • Experience with Python and performance profiling tools (e.g., Nsight, nvprof, perf).
  • Familiarity with cloud ML platforms like AWS, GCP, Azure ML.
  • Contributions to open-source ML benchmarking tools.

Skills & experience

SeniorPyTorchJAXMegatron-LMTensorRTCUDANCCLROCmDockerKubernetesPythonvLLMSGLangAWSGCPAzure MLNsightnvprofperf
02

Where you will work

About Nebius Group

Nebius Group, gevestigd in Amsterdam, is een technologiebedrijf dat zich richt op het leveren van full-stack AI cloud-infrastructuur. Het bedrijf biedt GPU-clusters, cloudplatformen en ontwikkelaarstools voor het beheer van de volledige machine learning-levenscyclus, van dataverwerking tot fine-tuning en inferencing.

03

More at this company

More jobs at Nebius Group

Senior Software Engineer (Token Factory)Full-timeView →Technical Product Manager - SoperatorFull-timeView →AI/ML Specialist Solutions ArchitectFull-timeView →Staff / Principal Applied AI Researcher (Agentic Search)Full-timeView →HPC System EngineerFull-timeView →Senior ML Engineer (AI Research)Full-timeView →
04

Keep exploring

Similar jobs

Software Engineer, Data Infrastructure & AcquisitionVeldhoven · Full-timeView →AI Business AnalystVeldhoven · Full-timeView →Lead Data EngineerFull-timeView →AI Solutions EngineerNijmegen · Full-timeView →Senior Data Engineer PricingFull-timeView →Staff Officer (Data Scientist) - NATO 2030Full-timeView →
dutchstartup.ai

The platform for the Dutch AI scene.

About·Contact·Privacy·Terms