领域
Observability & Monitoring
技能档案
PagerDuty, OpsGenie, escalation policies, runbooks, on-call rotation, SLA
角色数
14
包含此技能的角色
级别数
5
结构化成长路径
必要要求
30
其余 38 个可选
Observability & Monitoring
Incident Management
2026/3/17
选择当前级别并对比期望。下方卡片显示晋升所需掌握的内容。
表格展示从初级到首席的技能深度变化。点击行查看详情。
| 角色 | 必要性 | 描述 |
|---|---|---|
| Backend Developer (Go) | Understands on-call for Go services: responds to alerts, uses runbooks. Participates in incident response. | |
| Backend Developer (Java/Kotlin) | Understands on-call for Java: responds to alerts, thread dumps, heap dumps. Uses runbooks for troubleshooting. | |
| Backend Developer (Python) | Understands on-call for Python: responds to alerts, diagnoses exceptions. Uses runbooks. | |
| Cloud Engineer | Understands on-call basics for cloud infrastructure: alert triage procedures, runbook following for common incidents, and escalation paths. Participates in on-call rotations as secondary responder. Follows team practices for incident documentation and handoff procedures. | |
| Database Engineer / DBA | Participates in on-call rotation for the database tier: follows runbooks for alerts (high CPU, disk space, replication lag), escalates complex issues. Documents incidents and performs basic remediation actions. | |
| DevOps Engineer | Understands on-call principles: duty schedules, escalation, incident management. Participates in on-call under senior engineer guidance, responds to alerts following runbooks. Knows tools: PagerDuty, Opsgenie, VictorOps. | |
| DevSecOps Engineer | Participates in on-call rotation: responds to PagerDuty/OpsGenie alerts, follows runbooks for typical incidents. Documents actions and results. Understands escalation procedures. Studies basic security incidents: compromised credentials, suspicious login, certificate expiration. Maintains incident log. | |
| Game Server Developer | Understands on-call basics for game server infrastructure: player-impacting alert triage, game service health monitoring, and escalation procedures for live game issues. Participates in on-call rotations for game server operations. Follows team runbooks for common game server incidents. | |
| Network Engineer | Knows basic on-call management concepts for network engineering and can apply them in typical tasks. Uses standard tools and follows established team practices. Understands when and why this approach is used. | |
| Platform Engineer | Participates in on-call rotation for platform services: follows runbooks, escalates by procedure. Uses PagerDuty/OpsGenie for incident management. Documents incidents and actions in timeline. Understands severity levels and response time SLA for the platform. | |
| Security Analyst | Understands on-call basics for security operations: security alert triage procedures, SIEM dashboard monitoring, and security incident escalation paths. Participates in SOC rotations as junior analyst. Follows team procedures for security event investigation and documentation. | |
| Site Reliability Engineer (SRE) | Participates in on-call: follows escalation procedures, uses PagerDuty for alert management. Documents incidents. Hands off duty with handoff notes. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Backend Developer (Go) | Manages on-call: creates runbooks for Go services, configures alerting, post-incident review. | |
| Backend Developer (Java/Kotlin) | Manages on-call: creates runbooks for JVM issues, configures alerting, GC troubleshooting guides. | |
| Backend Developer (Python) | Manages on-call: runbooks for Python services, alerting configuration, incident response. | |
| Cloud Engineer | Participates in on-call rotation for cloud infrastructure. Configures PagerDuty/OpsGenie with escalation, responds to CloudWatch and Prometheus alerts. Classifies incidents by severity, performs initial diagnostics — checking metrics, logs, cloud provider status pages. | |
| Database Engineer / DBA | Handles database incidents independently: diagnosing deadlocks, query performance degradation, replication breaks. Writes and updates runbooks. Conducts post-incident reviews for database-related incidents. | |
| DevOps Engineer | Configures on-call processes: schedules in PagerDuty/Opsgenie, escalation policies, alert routing rules. Creates runbooks for typical incidents, automates initial diagnostics. Conducts post-mortems and tracks action items. | |
| DevSecOps Engineer | Configures on-call processes for security incidents: escalation policies, severity classification, notification channels. Creates runbooks for security on-call: credential compromise, DDoS, data breach, ransomware. Integrates PagerDuty with SIEM alerts. Conducts weekly on-call review and trend analysis. | |
| Engineering Manager | Configures on-call management for team services: alert routing policies, rotation schedules with fair load distribution, and escalation chains. Creates dashboards for on-call health metrics (alert volume, MTTA, pages per rotation). Participates in incident response and conducts post-incident reviews. | |
| Game Server Developer | Configures on-call management for game servers: player impact-based alert prioritization, game-specific health dashboards (CCU, match completion rate, latency percentiles), and automated remediation for common game service issues. Analyzes incident patterns to reduce toil. Participates in live ops on-call rotations. | |
| Network Engineer | Confidently applies on-call management for network engineering in non-standard tasks. Independently selects the optimal approach and tools. Analyzes trade-offs and proposes improvements to existing solutions. | |
| Platform Engineer | Configures on-call infrastructure for the platform: PagerDuty service dependencies, escalation policies, schedule overrides. Creates and updates runbooks for common incidents. Implements automated diagnostics: auto-remediation scripts, diagnostic dashboards. Conducts incident reviews. | |
| Security Analyst | Configures on-call management for security operations: security alert correlation rules, threat severity-based triage automation, and SOC shift handoff procedures. Creates security monitoring dashboards with threat intelligence integration. Analyzes security incident patterns for detection rule improvement. | |
| Site Reliability Engineer (SRE) | Manages on-call process: configures PagerDuty schedules and escalation policies, writes runbooks for typical alerts. Analyzes on-call burden: toil, alert quality, false positive rate. | |
| Technical Lead | Configures on-call management for product services: service-level alerting aligned with SLOs, PagerDuty/OpsGenie integration, and on-call rotation optimization. Creates runbooks for common incidents and automates remediation where possible. Mentors team on incident response practices and blameless post-mortems. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Backend Developer (Go) | Designs on-call practices: escalation policies, automated remediation, incident response playbooks. | |
| Backend Developer (Java/Kotlin) | Designs on-call: automated JVM diagnostics, escalation policies, incident playbooks. | |
| Backend Developer (Python) | Designs on-call: automated diagnostics, escalation policies, incident playbooks. | |
| Cloud Engineer | 必要 | Designs on-call processes for cloud team: alert routing by services and severity, runbooks for common incidents (disk full, OOM, AZ failure), post-incident review process. Optimizes alert noise — deduplication, suppression rules, actionable alerts. Reduces toil through automation. |
| Database Engineer / DBA | 必要 | Designs on-call processes for the DBA team: alert routing by severity, escalation policies, runbook automation. Mentors junior DBAs in incident response. Implements automated remediation for common database issues. |
| DevOps Engineer | 必要 | Designs incident management process: automated incident classification, PagerDuty integration with Slack/Jira/StatusPage. Implements incident commander role, automates communication through ChatOps. Configures SLO-based alerting to reduce alert fatigue. |
| DevSecOps Engineer | 必要 | Develops corporate security Incident Management process: Incident Commander role, communication templates, stakeholder notification. Introduces automated triage through PagerDuty Event Intelligence. Creates tiered response: L1 (SOC), L2 (Security Engineering), L3 (Principal). Conducts GameDay exercises. |
| Engineering Manager | 必要 | Designs on-call management strategy for multiple team services: SLO-based alerting architecture, automated incident response workflows, and on-call sustainability practices (burnout prevention, fair rotation). Implements observability-driven incident detection. Conducts post-mortem reviews and drives systemic improvements. Defines MTTD/MTTR targets and tracks reliability metrics. |
| Game Server Developer | 必要 | Designs on-call management architecture for game server platform: live ops incident response automation, predictive alerting for player experience degradation, and cross-region incident coordination. Implements game-specific SLIs (match quality, connection stability). Conducts post-mortems focused on player impact analysis. Defines on-call sustainability practices for game operations teams. |
| Network Engineer | Expertly applies on-call management for network engineering to design complex systems. Optimizes existing solutions and prevents architectural mistakes. Conducts code reviews and trains colleagues on best practices. | |
| Platform Engineer | 必要 | Designs incident management system for IDP: automated incident creation, war room automation, stakeholder communication. Implements incident retrospective process with action item tracking. Creates self-healing automation for typical platform incidents (node failures, OOM). |
| Security Analyst | 必要 | Designs on-call management for security operations center: advanced threat detection automation, SOAR integration for incident response orchestration, and cross-functional security incident coordination. Implements distributed tracing for security event correlation. Defines security-specific SLIs (detection time, containment time). Conducts security post-mortems and drives detection engineering improvements. |
| Site Reliability Engineer (SRE) | 必要 | Optimizes on-call: alert tuning for noise reduction, automated remediation for common issues. Designs runbook automation. Analyzes on-call metrics and creates improvement plan. |
| Technical Lead | 必要 | Designs observability strategy with On-Call Management. Implements distributed tracing. Defines SLIs/SLOs. Conducts post-mortems. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Backend Developer (Go) | Defines on-call standards for the Go team: rotation schedules, escalation policies, runbook requirements. Conducts incident post-mortems, improves alert quality and reduces alert fatigue. | |
| Backend Developer (Java/Kotlin) | Defines on-call standards for Java service platform: rotation policies aligned with team capacity, SLA-based escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Java service failures. Drives adoption of observability-driven incident management. | |
| Backend Developer (Python) | Defines on-call standards for Python service platform: rotation policies, SLA-driven escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Python service failures (memory leaks, GIL contention, dependency issues). Drives adoption of reliability engineering practices. | |
| Cloud Engineer | 必要 | Defines on-call strategy for the cloud organization: follow-the-sun rotation, tier-1/tier-2 escalation, incident commander role. Introduces incident management process (ITIL/SRE), blameless postmortems, reliability metrics (MTTA, MTTR). Manages on-call load balancing and burnout prevention. |
| Database Engineer / DBA | 必要 | Defines on-call standards for the database tier: rotation schedule, coverage requirements, alert fatigue reduction. Coordinates cross-team incident response. Conducts on-call retrospectives and improves processes. |
| DevOps Engineer | 必要 | Defines organizational incident management strategy: severity level standards, escalation matrices, communication protocols. Designs blameless post-mortem process, MTTR/MTTA metrics, SRE on-call program with sustainable rotation. |
| DevSecOps Engineer | 必要 | Defines incident management strategy for the security organization. Manages SOC team with 24/7 coverage. Builds metrics: MTTA, MTTD, MTTR, false positive rate. Introduces post-incident review processes with actionable improvements. Coordinates with legal, PR, management during major incidents. |
| Engineering Manager | 必要 | Defines on-call management strategy for product organization: SLO-based approach to incident management, on-call sustainability metrics, and cross-team incident coordination processes. Establishes post-mortem culture and tracks systemic improvements. Optimizes MTTD/MTTR across services. |
| Game Server Developer | 必要 | Defines on-call management strategy for game operations: live game incident management framework, player impact SLA definitions, and cross-studio incident coordination. Establishes post-mortem culture focused on player experience improvements. Drives adoption of proactive monitoring and automated remediation. |
| Network Engineer | Establishes on-call management standards for the network engineering team and makes architectural decisions. Defines the technical roadmap incorporating this skill. Mentors senior engineers and influences practices of adjacent teams. | |
| Platform Engineer | 必要 | Defines organizational incident management strategy: on-call expectations, compensation, burnout prevention. Leads MTTR improvement through automation and tooling. Designs cross-team incident coordination for complex incidents. Creates incident readiness program. |
| Security Analyst | 必要 | Defines on-call management strategy for security operations. Establishes SOC shift management, threat response SLA targets, and security incident escalation framework. Coordinates cross-team security incident response. Optimizes MTTD/MTTR for security events across the organization. |
| Site Reliability Engineer (SRE) | 必要 | Defines on-call standards: rotation policies, compensation, workload balance. Implements on-call metrics (interruptions, sleep impact). Builds sustainable on-call culture. |
| Technical Lead | 必要 | Defines on-call management strategy for the product. Establishes SLO-based alerting approach, incident management framework, and post-mortem process. Coordinates cross-team incident response. Optimizes MTTD/MTTR through observability improvements and automated remediation. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Backend Developer (Go) | Shapes on-call strategy: platform incident management, automated response, governance. | |
| Backend Developer (Java/Kotlin) | Shapes incident management strategy for Java platform organization: platform-level on-call automation, automated incident response governance, and reliability culture. Drives adoption of SRE practices across Java service teams. Establishes enterprise-wide incident management standards. | |
| Backend Developer (Python) | Shapes incident management strategy for Python platform organization: platform-level on-call automation, automated incident response patterns, and reliability culture. Drives adoption of SRE practices across Python service teams. Establishes enterprise-wide incident management and observability standards. | |
| Cloud Engineer | 必要 | Shapes enterprise-level incident management framework: unified incident process for multi-cloud, automated incident response (AWS Systems Manager, PagerDuty Rundeck), AIOps for anomaly detection. Defines organizational resilience strategy and chaos engineering program. |
| Database Engineer / DBA | 必要 | Shapes incident management strategy for the data platform: automated incident response, AI-assisted diagnostics, cross-database impact analysis. Defines on-call sustainability and investments in automation for database operations. |
| DevOps Engineer | 必要 | Develops corporate incident management culture: SRE principles, toil budgets, 80% incident automation. Defines AIOps platform architecture: ML-powered alert correlation, automated remediation, predictive incident prevention. |
| DevSecOps Engineer | 必要 | Architecturally designs enterprise Incident Response and Cyber Resilience program. Defines SOC strategy: automation level, staffing model, tooling. Develops Business Continuity Plan considering cyber threats. Builds IR maturity metrics for board-level reporting. Influences security budget. |
| Engineering Manager | 必要 | Defines organizational observability strategy. Implements platform solutions. Builds reliability culture. Establishes enterprise SLO framework. |
| Game Server Developer | 必要 | Defines organizational reliability strategy for game infrastructure: enterprise SLO framework for live games, platform observability and incident management solutions, and reliability culture across game studios. Drives adoption of SRE practices for game operations at scale. |
| Network Engineer | Shapes on-call management strategy for network engineering at the organizational level. Defines best practices and influences technology choices beyond their own team. Is a recognized expert in this area. | |
| Platform Engineer | 必要 | Shapes operational excellence culture: blameless postmortems, learning from incidents, reliability as feature. Defines AIOps strategy for automated incident response. Advises executives on investment in on-call tooling and reliability engineering for a sustainable platform. |
| Security Analyst | 必要 | Defines organizational strategy for security operations and incident management. Implements platform SOC solutions with AI-driven threat detection and automated response. Builds security reliability culture across the organization. Establishes enterprise SLO framework for security event handling. |
| Site Reliability Engineer (SRE) | 必要 | Designs organizational on-call model: follow-the-sun, tiered support, shared on-call between SRE and dev teams. Defines on-call governance and toil elimination strategy. |
| Technical Lead | 必要 | Defines the organization's observability strategy. Implements platform solutions. Shapes reliability culture. Defines enterprise SLO framework. |