As a Machine Learning Engineer, you play a crucial role in building systems for training and deployment of large-scale ML models in global operations. You collaborate with researchers, hardware experts, and software engineers on robust solutions that maximize GPU acceleration, distributed computing, and open-source tools.
Core Responsibilities:
- Develop large-scale distributed training pipelines for datasets and complex models
- Build and optimize low-latency inference pipelines for real-time predictions in production systems
- Develop libraries to improve ML framework performance
- Maximize training and inference performance with GPU hardware and acceleration libraries
- Design scalable model frameworks for high-volume trading data with real-time, high-accuracy predictions
- Collaborate with quantitative researchers on automating ML experiments, hyperparameter tuning, and model retraining
- Partner with HPC specialists for workflow optimization, training speed improvement, and cost reduction
- Evaluate and roll out third-party tools for model development, training, and inference
- Deep dive into internals of open-source ML tools for capability extension and performance improvement