As a Senior AI Infrastructure Engineer at Together AI, you play a key role in building the next generation AI cloud platform – a highly available, global, fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and provides ML practitioners with self-serve AI cloud services.
## Responsibilities
- Design, build, and maintain performant, secure, and highly available backend services/operators that run in data centers and automate hardware management, such as Infiniband partitioning, in-datacenter parallel storage provisioning, and VM provisioning
- Design and build the IaaS software layer for a new GB200 data center with thousands of GPUs
- Work on a global multi-exabyte high-performance object store serving massive pretraining datasets
- Build advanced observability stacks for customers with automated node lifecycle management for fault-tolerant distributed pretraining
- Perform architecture and research work for decentralized AI workloads
- Work on the core open-source Together AI platform
- Create services, tools, and developer documentation
- Create testing frameworks for robustness and fault-tolerance