技能档案

On-Call Management

PagerDuty, OpsGenie, escalation policies, runbooks, on-call rotation, SLA

Observability & Monitoring Incident Management

角色数

14

包含此技能的角色

级别数

5

结构化成长路径

必要要求

30

其余 38 个可选

领域

Observability & Monitoring

skills.group

Incident Management

最后更新

2026/3/17

如何使用

选择当前级别并对比期望。下方卡片显示晋升所需掌握的内容。

各级别期望

表格展示从初级到首席的技能深度变化。点击行查看详情。

角色 必要性 描述
Backend Developer (Go) Understands on-call for Go services: responds to alerts, uses runbooks. Participates in incident response.
Backend Developer (Java/Kotlin) Understands on-call for Java: responds to alerts, thread dumps, heap dumps. Uses runbooks for troubleshooting.
Backend Developer (Python) Understands on-call for Python: responds to alerts, diagnoses exceptions. Uses runbooks.
Cloud Engineer Understands on-call basics for cloud infrastructure: alert triage procedures, runbook following for common incidents, and escalation paths. Participates in on-call rotations as secondary responder. Follows team practices for incident documentation and handoff procedures.
Database Engineer / DBA Participates in on-call rotation for the database tier: follows runbooks for alerts (high CPU, disk space, replication lag), escalates complex issues. Documents incidents and performs basic remediation actions.
DevOps Engineer Understands on-call principles: duty schedules, escalation, incident management. Participates in on-call under senior engineer guidance, responds to alerts following runbooks. Knows tools: PagerDuty, Opsgenie, VictorOps.
DevSecOps Engineer Participates in on-call rotation: responds to PagerDuty/OpsGenie alerts, follows runbooks for typical incidents. Documents actions and results. Understands escalation procedures. Studies basic security incidents: compromised credentials, suspicious login, certificate expiration. Maintains incident log.
Game Server Developer Understands on-call basics for game server infrastructure: player-impacting alert triage, game service health monitoring, and escalation procedures for live game issues. Participates in on-call rotations for game server operations. Follows team runbooks for common game server incidents.
Network Engineer Knows basic on-call management concepts for network engineering and can apply them in typical tasks. Uses standard tools and follows established team practices. Understands when and why this approach is used.
Platform Engineer Participates in on-call rotation for platform services: follows runbooks, escalates by procedure. Uses PagerDuty/OpsGenie for incident management. Documents incidents and actions in timeline. Understands severity levels and response time SLA for the platform.
Security Analyst Understands on-call basics for security operations: security alert triage procedures, SIEM dashboard monitoring, and security incident escalation paths. Participates in SOC rotations as junior analyst. Follows team procedures for security event investigation and documentation.
Site Reliability Engineer (SRE) Participates in on-call: follows escalation procedures, uses PagerDuty for alert management. Documents incidents. Hands off duty with handoff notes.
角色 必要性 描述
Backend Developer (Go) Manages on-call: creates runbooks for Go services, configures alerting, post-incident review.
Backend Developer (Java/Kotlin) Manages on-call: creates runbooks for JVM issues, configures alerting, GC troubleshooting guides.
Backend Developer (Python) Manages on-call: runbooks for Python services, alerting configuration, incident response.
Cloud Engineer Participates in on-call rotation for cloud infrastructure. Configures PagerDuty/OpsGenie with escalation, responds to CloudWatch and Prometheus alerts. Classifies incidents by severity, performs initial diagnostics — checking metrics, logs, cloud provider status pages.
Database Engineer / DBA Handles database incidents independently: diagnosing deadlocks, query performance degradation, replication breaks. Writes and updates runbooks. Conducts post-incident reviews for database-related incidents.
DevOps Engineer Configures on-call processes: schedules in PagerDuty/Opsgenie, escalation policies, alert routing rules. Creates runbooks for typical incidents, automates initial diagnostics. Conducts post-mortems and tracks action items.
DevSecOps Engineer Configures on-call processes for security incidents: escalation policies, severity classification, notification channels. Creates runbooks for security on-call: credential compromise, DDoS, data breach, ransomware. Integrates PagerDuty with SIEM alerts. Conducts weekly on-call review and trend analysis.
Engineering Manager Configures on-call management for team services: alert routing policies, rotation schedules with fair load distribution, and escalation chains. Creates dashboards for on-call health metrics (alert volume, MTTA, pages per rotation). Participates in incident response and conducts post-incident reviews.
Game Server Developer Configures on-call management for game servers: player impact-based alert prioritization, game-specific health dashboards (CCU, match completion rate, latency percentiles), and automated remediation for common game service issues. Analyzes incident patterns to reduce toil. Participates in live ops on-call rotations.
Network Engineer Confidently applies on-call management for network engineering in non-standard tasks. Independently selects the optimal approach and tools. Analyzes trade-offs and proposes improvements to existing solutions.
Platform Engineer Configures on-call infrastructure for the platform: PagerDuty service dependencies, escalation policies, schedule overrides. Creates and updates runbooks for common incidents. Implements automated diagnostics: auto-remediation scripts, diagnostic dashboards. Conducts incident reviews.
Security Analyst Configures on-call management for security operations: security alert correlation rules, threat severity-based triage automation, and SOC shift handoff procedures. Creates security monitoring dashboards with threat intelligence integration. Analyzes security incident patterns for detection rule improvement.
Site Reliability Engineer (SRE) Manages on-call process: configures PagerDuty schedules and escalation policies, writes runbooks for typical alerts. Analyzes on-call burden: toil, alert quality, false positive rate.
Technical Lead Configures on-call management for product services: service-level alerting aligned with SLOs, PagerDuty/OpsGenie integration, and on-call rotation optimization. Creates runbooks for common incidents and automates remediation where possible. Mentors team on incident response practices and blameless post-mortems.
角色 必要性 描述
Backend Developer (Go) Designs on-call practices: escalation policies, automated remediation, incident response playbooks.
Backend Developer (Java/Kotlin) Designs on-call: automated JVM diagnostics, escalation policies, incident playbooks.
Backend Developer (Python) Designs on-call: automated diagnostics, escalation policies, incident playbooks.
Cloud Engineer 必要 Designs on-call processes for cloud team: alert routing by services and severity, runbooks for common incidents (disk full, OOM, AZ failure), post-incident review process. Optimizes alert noise — deduplication, suppression rules, actionable alerts. Reduces toil through automation.
Database Engineer / DBA 必要 Designs on-call processes for the DBA team: alert routing by severity, escalation policies, runbook automation. Mentors junior DBAs in incident response. Implements automated remediation for common database issues.
DevOps Engineer 必要 Designs incident management process: automated incident classification, PagerDuty integration with Slack/Jira/StatusPage. Implements incident commander role, automates communication through ChatOps. Configures SLO-based alerting to reduce alert fatigue.
DevSecOps Engineer 必要 Develops corporate security Incident Management process: Incident Commander role, communication templates, stakeholder notification. Introduces automated triage through PagerDuty Event Intelligence. Creates tiered response: L1 (SOC), L2 (Security Engineering), L3 (Principal). Conducts GameDay exercises.
Engineering Manager 必要 Designs on-call management strategy for multiple team services: SLO-based alerting architecture, automated incident response workflows, and on-call sustainability practices (burnout prevention, fair rotation). Implements observability-driven incident detection. Conducts post-mortem reviews and drives systemic improvements. Defines MTTD/MTTR targets and tracks reliability metrics.
Game Server Developer 必要 Designs on-call management architecture for game server platform: live ops incident response automation, predictive alerting for player experience degradation, and cross-region incident coordination. Implements game-specific SLIs (match quality, connection stability). Conducts post-mortems focused on player impact analysis. Defines on-call sustainability practices for game operations teams.
Network Engineer Expertly applies on-call management for network engineering to design complex systems. Optimizes existing solutions and prevents architectural mistakes. Conducts code reviews and trains colleagues on best practices.
Platform Engineer 必要 Designs incident management system for IDP: automated incident creation, war room automation, stakeholder communication. Implements incident retrospective process with action item tracking. Creates self-healing automation for typical platform incidents (node failures, OOM).
Security Analyst 必要 Designs on-call management for security operations center: advanced threat detection automation, SOAR integration for incident response orchestration, and cross-functional security incident coordination. Implements distributed tracing for security event correlation. Defines security-specific SLIs (detection time, containment time). Conducts security post-mortems and drives detection engineering improvements.
Site Reliability Engineer (SRE) 必要 Optimizes on-call: alert tuning for noise reduction, automated remediation for common issues. Designs runbook automation. Analyzes on-call metrics and creates improvement plan.
Technical Lead 必要 Designs observability strategy with On-Call Management. Implements distributed tracing. Defines SLIs/SLOs. Conducts post-mortems.
角色 必要性 描述
Backend Developer (Go) Defines on-call standards for the Go team: rotation schedules, escalation policies, runbook requirements. Conducts incident post-mortems, improves alert quality and reduces alert fatigue.
Backend Developer (Java/Kotlin) Defines on-call standards for Java service platform: rotation policies aligned with team capacity, SLA-based escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Java service failures. Drives adoption of observability-driven incident management.
Backend Developer (Python) Defines on-call standards for Python service platform: rotation policies, SLA-driven escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Python service failures (memory leaks, GIL contention, dependency issues). Drives adoption of reliability engineering practices.
Cloud Engineer 必要 Defines on-call strategy for the cloud organization: follow-the-sun rotation, tier-1/tier-2 escalation, incident commander role. Introduces incident management process (ITIL/SRE), blameless postmortems, reliability metrics (MTTA, MTTR). Manages on-call load balancing and burnout prevention.
Database Engineer / DBA 必要 Defines on-call standards for the database tier: rotation schedule, coverage requirements, alert fatigue reduction. Coordinates cross-team incident response. Conducts on-call retrospectives and improves processes.
DevOps Engineer 必要 Defines organizational incident management strategy: severity level standards, escalation matrices, communication protocols. Designs blameless post-mortem process, MTTR/MTTA metrics, SRE on-call program with sustainable rotation.
DevSecOps Engineer 必要 Defines incident management strategy for the security organization. Manages SOC team with 24/7 coverage. Builds metrics: MTTA, MTTD, MTTR, false positive rate. Introduces post-incident review processes with actionable improvements. Coordinates with legal, PR, management during major incidents.
Engineering Manager 必要 Defines on-call management strategy for product organization: SLO-based approach to incident management, on-call sustainability metrics, and cross-team incident coordination processes. Establishes post-mortem culture and tracks systemic improvements. Optimizes MTTD/MTTR across services.
Game Server Developer 必要 Defines on-call management strategy for game operations: live game incident management framework, player impact SLA definitions, and cross-studio incident coordination. Establishes post-mortem culture focused on player experience improvements. Drives adoption of proactive monitoring and automated remediation.
Network Engineer Establishes on-call management standards for the network engineering team and makes architectural decisions. Defines the technical roadmap incorporating this skill. Mentors senior engineers and influences practices of adjacent teams.
Platform Engineer 必要 Defines organizational incident management strategy: on-call expectations, compensation, burnout prevention. Leads MTTR improvement through automation and tooling. Designs cross-team incident coordination for complex incidents. Creates incident readiness program.
Security Analyst 必要 Defines on-call management strategy for security operations. Establishes SOC shift management, threat response SLA targets, and security incident escalation framework. Coordinates cross-team security incident response. Optimizes MTTD/MTTR for security events across the organization.
Site Reliability Engineer (SRE) 必要 Defines on-call standards: rotation policies, compensation, workload balance. Implements on-call metrics (interruptions, sleep impact). Builds sustainable on-call culture.
Technical Lead 必要 Defines on-call management strategy for the product. Establishes SLO-based alerting approach, incident management framework, and post-mortem process. Coordinates cross-team incident response. Optimizes MTTD/MTTR through observability improvements and automated remediation.
角色 必要性 描述
Backend Developer (Go) Shapes on-call strategy: platform incident management, automated response, governance.
Backend Developer (Java/Kotlin) Shapes incident management strategy for Java platform organization: platform-level on-call automation, automated incident response governance, and reliability culture. Drives adoption of SRE practices across Java service teams. Establishes enterprise-wide incident management standards.
Backend Developer (Python) Shapes incident management strategy for Python platform organization: platform-level on-call automation, automated incident response patterns, and reliability culture. Drives adoption of SRE practices across Python service teams. Establishes enterprise-wide incident management and observability standards.
Cloud Engineer 必要 Shapes enterprise-level incident management framework: unified incident process for multi-cloud, automated incident response (AWS Systems Manager, PagerDuty Rundeck), AIOps for anomaly detection. Defines organizational resilience strategy and chaos engineering program.
Database Engineer / DBA 必要 Shapes incident management strategy for the data platform: automated incident response, AI-assisted diagnostics, cross-database impact analysis. Defines on-call sustainability and investments in automation for database operations.
DevOps Engineer 必要 Develops corporate incident management culture: SRE principles, toil budgets, 80% incident automation. Defines AIOps platform architecture: ML-powered alert correlation, automated remediation, predictive incident prevention.
DevSecOps Engineer 必要 Architecturally designs enterprise Incident Response and Cyber Resilience program. Defines SOC strategy: automation level, staffing model, tooling. Develops Business Continuity Plan considering cyber threats. Builds IR maturity metrics for board-level reporting. Influences security budget.
Engineering Manager 必要 Defines organizational observability strategy. Implements platform solutions. Builds reliability culture. Establishes enterprise SLO framework.
Game Server Developer 必要 Defines organizational reliability strategy for game infrastructure: enterprise SLO framework for live games, platform observability and incident management solutions, and reliability culture across game studios. Drives adoption of SRE practices for game operations at scale.
Network Engineer Shapes on-call management strategy for network engineering at the organizational level. Defines best practices and influences technology choices beyond their own team. Is a recognized expert in this area.
Platform Engineer 必要 Shapes operational excellence culture: blameless postmortems, learning from incidents, reliability as feature. Defines AIOps strategy for automated incident response. Advises executives on investment in on-call tooling and reliability engineering for a sustainable platform.
Security Analyst 必要 Defines organizational strategy for security operations and incident management. Implements platform SOC solutions with AI-driven threat detection and automated response. Builds security reliability culture across the organization. Establishes enterprise SLO framework for security event handling.
Site Reliability Engineer (SRE) 必要 Designs organizational on-call model: follow-the-sun, tiered support, shared on-call between SRE and dev teams. Defines on-call governance and toil elimination strategy.
Technical Lead 必要 Defines the organization's observability strategy. Implements platform solutions. Shapes reliability culture. Defines enterprise SLO framework.

社区

👁 关注 ✏️ 建议修改 登录以建议修改
📋 提案
暂无提案 On-Call Management
正在加载评论...