Skill Profile

Model Serving

Triton, BentoML, Seldon: model deployment, A/B testing, canary releases

Machine Learning & AI MLOps

Roles

7

where this skill appears

Levels

5

structured growth path

Mandatory requirements

25

the other 10 optional

Domain

Machine Learning & AI

Group

MLOps

Last updated

3/17/2026

How to Use

Choose your current level and compare expectations. The items below show what to cover to advance to the next level.

What is Expected at Each Level

The table shows how skill depth grows from Junior to Principal. Click a row to see details.

Role Required Description
AI Product Engineer Understands model serving basics for AI products: REST/gRPC inference endpoints, model versioning for A/B testing, and basic latency/throughput requirements. Follows team guidelines on integrating model predictions into product features. Understands differences between batch and real-time inference.
Computer Vision Engineer Understands model serving basics for CV systems: image/video inference pipeline setup, GPU resource allocation for inference, and model format conversion (ONNX, TensorRT). Follows team practices for deploying CV models to production endpoints.
Data Scientist Understands model serving basics: exporting trained models (pickle, ONNX, SavedModel), basic API wrapper creation with Flask/FastAPI, and model input/output schema definition. Follows team practices for model packaging and deployment workflows.
LLM Engineer Understands LLM serving basics: inference API setup (vLLM, TGI), prompt/completion endpoint configuration, and token-based billing considerations. Follows team practices for LLM deployment including context window management and response streaming setup.
ML Engineer Required Deploys ML model as REST API through web framework/Flask. Understands inference pipeline: preprocessing → prediction → postprocessing. Uses pickle/joblib for model serialization.
MLOps Engineer Understands basic model serving concepts: difference between batch and real-time inference, main model formats (ONNX, SavedModel, pickle). Can deploy a simple model via Flask/FastAPI endpoint, load a model from file, and return predictions. Knows about specialized serving systems — TFServing, Triton, Seldon.
NLP Engineer Required Knows NLP model serving basics: REST API endpoints, model loading, batching. Deploys simple NLP models as REST API for text classification and NER tasks.
Role Required Description
AI Product Engineer Implements model serving for AI product features: multi-model inference pipelines, feature store integration for real-time enrichment, and A/B testing infrastructure for model comparison. Configures auto-scaling based on inference load patterns. Implements model fallback strategies for high-availability product features.
Computer Vision Engineer Implements CV model serving pipelines: batch and real-time inference with GPU optimization, model ensemble strategies for accuracy improvement, and pre/post-processing pipeline optimization. Configures TensorRT/ONNX Runtime for inference acceleration. Implements model caching and warm-up strategies for consistent latency.
Data Scientist Implements model serving solutions: containerized model deployment (Docker/K8s), monitoring for data drift and prediction quality, and canary deployment for safe model rollout. Uses MLflow/BentoML for model packaging and serving. Implements feature engineering in serving pipeline consistent with training.
LLM Engineer Implements LLM serving solutions: KV-cache optimization for throughput, batching strategies (continuous batching, dynamic batching), and quantization for cost-efficient inference (GPTQ, AWQ, GGUF). Configures vLLM/TGI for production workloads. Implements streaming response infrastructure and token-level latency monitoring.
ML Engineer Required Uses model serving frameworks: Triton, BentoML, Seldon. Configures batch and real-time inference. Optimizes inference latency (ONNX, model optimization). Configures A/B testing for models.
MLOps Engineer Deploys models to production via specialized serving platforms: TensorFlow Serving for TF models, Triton Inference Server for multi-framework serving. Configures BentoML for packaging models with dependencies, implements batch inference via Spark/Ray, and configures model versioning for seamless production model updates.
NLP Engineer Required Independently designs NLP model serving: TorchServe, Triton Inference Server. Configures batching, model versioning, A/B testing. Optimizes latency through model optimization.
Role Required Description
AI Product Engineer Required Designs model serving architecture for AI products: multi-model orchestration with routing logic, real-time feature computation for inference enrichment, and cost-optimized serving with tiered model selection. Implements serving observability: latency percentiles, prediction quality metrics, and cost-per-inference tracking. Creates model deployment governance for AI products. Mentors team on production ML patterns.
Computer Vision Engineer Required Designs CV model serving architecture: edge-cloud hybrid inference for latency-critical applications, multi-GPU serving with dynamic batching, and model distillation pipelines for deployment optimization. Implements serving monitoring: inference latency, GPU utilization, and prediction accuracy tracking. Creates reference architectures for CV model deployment. Mentors team on production CV system design.
Data Scientist Required Designs model serving architecture: scalable inference platforms, model registry integration with automated deployment, and online/offline feature consistency guarantees. Implements advanced monitoring: data drift detection, model performance degradation alerts, and automated retraining triggers. Creates serving best practices and model deployment standards. Mentors team on MLOps patterns.
LLM Engineer Required Designs LLM serving architecture: multi-model gateway with intelligent routing, speculative decoding for latency optimization, and disaggregated serving (prefill/decode separation). Implements cost optimization: token budget management, caching layers for repeated prompts, and model cascade strategies. Creates LLM serving benchmarks and capacity planning models. Mentors team on production LLM infrastructure.
ML Engineer Required Designs model serving architecture. Optimizes throughput (batching, GPU scheduling). Configures autoscaling for ML serving. Implements model fallback and canary deployment.
MLOps Engineer Required Architects model serving for complex scenarios: multi-model serving with dynamic loading, ensemble inference via Triton, model A/B testing. Optimizes latency through model optimization (TensorRT, ONNX Runtime), implements GPU sharing for efficient resource utilization, and designs autoscaling based on inference metrics.
NLP Engineer Required Designs high-performance serving infrastructure for NLP models. Optimizes through quantization, distillation, model parallelism. Ensures latency and throughput SLA.
Role Required Description
AI Product Engineer Required Defines model serving strategy for AI product platform. Establishes inference SLA targets, cost management frameworks, and model deployment governance. Conducts architecture reviews for AI serving infrastructure. Drives adoption of efficient model serving patterns across product teams.
Computer Vision Engineer Required Defines model serving strategy for CV engineering teams. Establishes inference performance standards, GPU resource governance, and model deployment pipelines. Conducts architecture reviews for CV serving infrastructure. Drives adoption of optimized inference patterns for production CV systems.
Data Scientist Required Defines model serving strategy for ML teams. Establishes model deployment standards, serving infrastructure requirements, and monitoring governance. Conducts reviews of serving architectures. Drives adoption of MLOps best practices for reliable model deployment across teams.
LLM Engineer Required Defines LLM serving strategy for the organization. Establishes inference cost management policies, serving SLA targets, and GPU infrastructure governance. Evaluates serving frameworks (vLLM, TGI, TensorRT-LLM). Conducts architecture reviews for LLM infrastructure. Drives adoption of cost-efficient LLM serving patterns.
ML Engineer Required Defines model serving strategy for the platform. Designs unified serving layer. Optimizes serving costs. Coordinates with DevOps on infrastructure.
MLOps Engineer Required Defines the model serving strategy for the MLOps team: standard stack (KServe/Seldon Core on Kubernetes), deployment patterns (canary, shadow, blue-green). Implements unified model rollout process with mandatory quality checks, configures SLA monitoring for latency, and defines runbooks for inference service incidents.
NLP Engineer Required Defines model serving strategy for the NLP team. Establishes deployment standards, SLA framework, and architectural decisions for scaling NLP inference infrastructure.
Role Required Description
AI Product Engineer Required Defines organizational AI serving strategy: inference platform architecture, model deployment governance, and AI infrastructure investment decisions. Evaluates build-vs-buy for serving infrastructure (self-hosted vs managed services). Drives adoption of production ML excellence across the organization.
Computer Vision Engineer Required Defines organizational strategy for CV model serving infrastructure: edge/cloud inference architecture, GPU fleet management, and hardware selection for CV workloads. Evaluates emerging inference technologies (custom ASICs, neuromorphic computing). Drives adoption of production CV excellence across the organization.
Data Scientist Required Defines organizational ML serving strategy: inference platform standardization, model deployment governance, and ML infrastructure investment roadmap. Evaluates emerging serving technologies and hardware. Drives adoption of production ML best practices across all data science teams.
LLM Engineer Required Defines organizational LLM serving strategy: inference infrastructure architecture, GPU/TPU procurement strategy, and cost governance for LLM operations. Evaluates build-vs-buy decisions for LLM infrastructure (self-hosted vs API providers). Drives adoption of efficient LLM serving practices and shapes technical vision for AI infrastructure at enterprise scale.
ML Engineer Required Defines enterprise model serving strategy. Evaluates serving technologies. Designs multi-model serving platform.
MLOps Engineer Required Shapes the model serving strategy at the organizational level: unified serving platform for all model types (CV, NLP, tabular), SLA standards. Designs architecture for scaling to thousands of models — model mesh, serverless inference, edge deployment. Defines GPU infrastructure cost optimization strategy for inference and platform roadmap.
NLP Engineer Required Shapes enterprise model serving strategy for the NLP platform. Defines inference infrastructure architecture, optimization standards, and cost management at organizational level.

Community

👁 Watch ✏️ Suggest Change Sign in to suggest changes
📋 Proposals
No proposals yet for Model Serving
Loading comments...