Domain
Observability & Monitoring
Skill Profile
Business KPIs, domain metrics, dashboards, alerting on business indicators
Roles
7
where this skill appears
Levels
5
structured growth path
Mandatory requirements
21
the other 14 optional
Observability & Monitoring
Metrics & Monitoring
3/17/2026
Choose your current level and compare expectations. The items below show what to cover to advance to the next level.
The table shows how skill depth grows from Junior to Principal. Click a row to see details.
| Role | Required | Description |
|---|---|---|
| BI Analyst | Learns to define custom business metrics such as conversion rates, churn, and LTV using SQL and BI tools. Follows existing metric definitions and dashboards created by senior analysts. | |
| Database Engineer / DBA | Understands how custom database metrics like query latency, replication lag, and connection pool usage are collected. Monitors predefined metric dashboards and escalates anomalies to senior engineers. | |
| DevOps Engineer | Understands Prometheus metric types: counter, gauge, histogram, summary. Knows the difference between infrastructure and business metrics. Reads existing custom metrics, understands naming conventions (namespace_subsystem_name_unit). | |
| MLOps Engineer | Learns to track custom ML metrics such as model drift, prediction latency, and data quality scores. Uses existing monitoring pipelines and Grafana dashboards to observe model performance. | |
| Performance Testing Engineer | Creates custom performance metrics: business transaction latency, throughput per endpoint, error rate by type. Uses Prometheus client for application metrics. | |
| Platform Engineer | Instruments platform services with custom metrics: request duration, queue depth, cache hit ratio. Uses Prometheus client libraries for exposing metrics. Understands naming conventions and label best practices. Creates dashboards for visualizing custom metrics. | |
| Site Reliability Engineer (SRE) | Creates custom metrics: application counters, business KPIs through Prometheus client libraries. Understands RED metrics (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors). |
| Role | Required | Description |
|---|---|---|
| BI Analyst | Configures Custom Business Metrics for services. Creates dashboards and alerts. Participates in on-call rotation. Analyzes incidents. | |
| Database Engineer / DBA | Configures custom database health metrics including slow query rates, index efficiency, and storage growth trends. Creates Grafana dashboards with alerts for replication lag and deadlock frequency. Participates in on-call rotation. | |
| DevOps Engineer | Develops custom metrics for DevOps: CI/CD pipeline metrics (build duration, success rate), deployment frequency, change failure rate. Creates exporters in Python/Go, instruments applications through client libraries. Configures recording rules. | |
| MLOps Engineer | Configures custom metrics for ML pipelines including training throughput, feature store freshness, and inference SLOs. Builds alerting rules for model degradation and data pipeline failures. Analyzes production incidents affecting model serving. | |
| Performance Testing Engineer | Designs performance metrics: detailed latency breakdown (db, external, processing), resource efficiency metrics. Configures recording rules for aggregation. | |
| Platform Engineer | Develops standard metrics library for the platform: RED metrics (Rate, Errors, Duration), USE metrics for infrastructure. Creates metric generator middleware for automatic collection. Configures metric-based autoscaling (KEDA, custom HPA). Optimizes cardinality. | |
| Site Reliability Engineer (SRE) | Designs custom metrics: SLI metrics for SLO tracking, detailed latency histograms, business metrics. Configures recording rules for aggregation. Creates alerting on custom metrics. |
| Role | Required | Description |
|---|---|---|
| BI Analyst | Required | Designs observability strategy with Custom Business Metrics. Implements distributed tracing. Defines SLI/SLO. Conducts post-mortems. |
| Database Engineer / DBA | Required | Designs observability strategy with business metrics. Implements distributed tracing. Defines SLI/SLO. Conducts post-mortems. |
| DevOps Engineer | Required | Designs custom metrics system: DORA metrics for delivery performance evaluation, SLI metrics for each service, business KPIs in Prometheus. Implements OpenTelemetry Metrics, develops custom collectors for non-standard sources. |
| MLOps Engineer | Required | Architects observability strategy with business metrics. Implements distributed tracing. Defines SLIs/SLOs. Conducts post-mortems. |
| Performance Testing Engineer | Required | Defines performance metrics framework: standard instrumentation, custom metrics for bottleneck detection, derived metrics for analysis. Implements automated anomaly detection. |
| Platform Engineer | Required | Designs metrics strategy for IDP: business metrics pipeline, SLI/SLO automation through custom metrics. Implements metrics-as-code approach: Terraform for Grafana dashboards, alerts through GitOps. Creates self-service metrics onboarding for new services through Backstage. |
| Site Reliability Engineer (SRE) | Required | Defines metrics framework: standard instrumentation library, metric naming conventions, cardinality management. Implements derived metrics through recording rules. Integrates with SLO tooling. |
| Role | Required | Description |
|---|---|---|
| BI Analyst | Required | Defines product observability strategy. Establishes SLO-based approach. Coordinates incident management. Optimizes MTTD/MTTR. |
| Database Engineer / DBA | Required | Defines custom metrics standards for the database tier: business-level metrics (orders/sec, active users), database-specific (replication slot lag, vacuum progress). Coordinates custom metrics implementation between DBA and dev teams. |
| DevOps Engineer | Required | Defines organizational metrics standards: mandatory SLIs for each service tier, DORA metrics dashboard, FinOps metrics. Designs metrics platform with standard metrics catalog, self-service instrumentation and automated alerting. |
| MLOps Engineer | Required | Defines custom metrics standards for the MLOps team: ML-specific metrics (prediction_confidence, feature_freshness, model_staleness), business metrics tied to models. Implements a unified metrics library for inference services, standardizes labels and naming conventions for Prometheus, and configures composite alerts for ML system degradation. |
| Performance Testing Engineer | Required | Defines custom metrics standards for performance: mandatory instrumentation, naming conventions, cardinality budget. Implements a performance metrics catalog. |
| Platform Engineer | Required | Defines organizational metrics strategy: golden signals for each tier, cardinality budget, cost allocation. Leads observability standards adoption. Designs metric-driven decision framework: automated scaling, deployment decisions, capacity planning based on custom metrics. |
| Site Reliability Engineer (SRE) | Required | Defines SRE metrics standards: mandatory SLI metrics, dashboard templates, alerting best practices. Implements metric catalogs and automated cardinality monitoring. |
| Role | Required | Description |
|---|---|---|
| BI Analyst | Required | Defines organizational observability strategy. Implements platform solutions. Builds reliability culture. Establishes enterprise SLO framework. |
| Database Engineer / DBA | Required | Shapes data platform metrics strategy: custom business metrics via database, cardinality management, metrics for capacity planning and cost attribution. Defines the framework for organizational database performance KPIs. |
| DevOps Engineer | Required | Develops metrics-driven operations strategy: ML-powered anomaly detection on custom metrics, predictive scaling, business observability. Defines unified metrics platform architecture for correlating infrastructure, application and business metrics. |
| MLOps Engineer | Required | Shapes the metrics strategy for the organization's MLOps platform: unified ML metrics taxonomy, standards for model health scoring and platform reliability. Designs automated model ROI calculation systems by linking ML metrics with business KPIs, defines cost-per-prediction metrics and composite health indicators for all production models. |
| Performance Testing Engineer | Required | Designs performance metrics strategy: organization-wide instrumentation standards, automated baseline calculation, ML-based anomaly detection. |
| Platform Engineer | Required | Shapes vision for data-driven platform: custom metrics + ML for predictive operations, anomaly detection, root cause analysis. Defines metric democratization strategy for business and tech teams. Evaluates real-time streaming analytics for next-gen observability platform. |
| Site Reliability Engineer (SRE) | Required | Designs organizational metrics framework: unified instrumentation SDK, metrics taxonomy, automated SLI generation. Defines observability cost management strategy. |