Wat je gaat doen

Over deze rol

# AI Model Serving & Inference Engineer

Join Tether's AI model team and drive innovation in model serving and inference architectures for advanced AI systems. You will optimize model deployment and inference strategies to deliver highly responsive, efficient, and scalable performance across real-world applications, working on systems ranging from resource-efficient models for limited hardware to complex multi-modal architectures.

## Responsibilities

Design and deploy state-of-the-art model serving architectures that deliver high throughput and low latency while optimizing memory usage across diverse environments, including resource-constrained devices and edge platforms
Build, run, and monitor controlled inference tests in simulated and live production environments, tracking key performance indicators such as response latency, throughput, memory consumption, and error rates
Identify and prepare high-quality test datasets and simulation scenarios tailored to real-world deployment challenges on low-resource devices
Analyze computational efficiency and diagnose bottlenecks in the serving pipeline by monitoring processing and memory metrics
Work closely with cross-functional teams to integrate optimized serving and inference frameworks into production pipelines designed for edge and on-device applications

## Requirements

Degree in Computer Science or related field; ideally PhD in NLP, Machine Learning, or related field with solid track record in AI R&D and publications in top-tier conferences
Knowledge of Metal Shading Language (MSL) with ability to write custom compute shaders from scratch
Proven experience in low-level kernel optimizations and inference optimization on mobile devices
Deep understanding of modern model serving architectures and inference optimization techniques
Strong expertise in writing GPU kernels for mobile devices (smartphones) and deep understanding of model serving frameworks
Practical experience developing and deploying end-to-end inference pipelines on resource-constrained devices
Knowledge of distributed inference systems, Diffusion Models, Vision Transformers, Pruning, Quantization, Flash Attention, KV Cache, and Speculative Decoding

Skills & ervaring

SeniorMetal Shading Language (MSL)GPU kernel optimizationMachine LearningModel servingInference optimizationMobile device optimizationDistributed Inference SystemsTensor ParallelismPipeline ParallelismExpert ParallelismDiffusion ModelsVision TransformersPruningQuantizationFlash AttentionKV CacheSpeculative Decoding

Meer bij dit bedrijf

Meer vacatures

AI Inference Engineer QVAC (100% remote Worldwide)FulltimeBekijk →Research Engineer Intern (Multimodal LLM)StageBekijk →Research Engineer Intern (Multimodal LLM)StageBekijk →

Verder kijken

Vergelijkbare vacatures

Software Engineer, Data Infrastructure & AcquisitionVeldhoven · FulltimeBekijk →AI Business AnalystVeldhoven · FulltimeBekijk →Lead Data EngineerFulltimeBekijk →AI Solutions EngineerNijmegen · FulltimeBekijk →Senior Data Engineer PricingFulltimeBekijk →Staff Officer (Data Scientist) - NATO 2030FulltimeBekijk →

Wat je gaat doen

Over deze rol

# AI Model Serving & Inference Engineer

## Responsibilities

Design and deploy state-of-the-art model serving architectures that deliver high throughput and low latency while optimizing memory usage across diverse environments, including resource-constrained devices and edge platforms
Build, run, and monitor controlled inference tests in simulated and live production environments, tracking key performance indicators such as response latency, throughput, memory consumption, and error rates
Identify and prepare high-quality test datasets and simulation scenarios tailored to real-world deployment challenges on low-resource devices
Analyze computational efficiency and diagnose bottlenecks in the serving pipeline by monitoring processing and memory metrics
Work closely with cross-functional teams to integrate optimized serving and inference frameworks into production pipelines designed for edge and on-device applications

## Requirements

Degree in Computer Science or related field; ideally PhD in NLP, Machine Learning, or related field with solid track record in AI R&D and publications in top-tier conferences
Knowledge of Metal Shading Language (MSL) with ability to write custom compute shaders from scratch
Proven experience in low-level kernel optimizations and inference optimization on mobile devices
Deep understanding of modern model serving architectures and inference optimization techniques
Strong expertise in writing GPU kernels for mobile devices (smartphones) and deep understanding of model serving frameworks
Practical experience developing and deploying end-to-end inference pipelines on resource-constrained devices
Knowledge of distributed inference systems, Diffusion Models, Vision Transformers, Pruning, Quantization, Flash Attention, KV Cache, and Speculative Decoding

Skills & ervaring

Meer bij dit bedrijf

Meer vacatures

AI Inference Engineer QVAC (100% remote Worldwide)FulltimeBekijk →Research Engineer Intern (Multimodal LLM)StageBekijk →Research Engineer Intern (Multimodal LLM)StageBekijk →

Verder kijken

AI Research Engineer (Kernel & Inference Optimization) - 100% Remote Worldwide

Over deze rol

Meer vacatures

Vergelijkbare vacatures

AI Research Engineer (Kernel & Inference Optimization) - 100% Remote Worldwide

Over deze rol

Meer vacatures

Vergelijkbare vacatures