技能档案

On-Call Management

PagerDuty, OpsGenie, escalation policies, runbooks, on-call rotation, SLA

Observability & Monitoring Incident Management

角色数

包含此技能的角色

级别数

结构化成长路径

必要要求

其余 38 个可选

领域

Observability & Monitoring

skills.group

Incident Management

最后更新

2026/3/17

如何使用

选择当前级别并对比期望。下方卡片显示晋升所需掌握的内容。

各级别期望

表格展示从初级到首席的技能深度变化。点击行查看详情。

角色	必要性	描述
Backend Developer (Go)		Understands on-call for Go services: responds to alerts, uses runbooks. Participates in incident response.
Backend Developer (Java/Kotlin)		Understands on-call for Java: responds to alerts, thread dumps, heap dumps. Uses runbooks for troubleshooting.
Backend Developer (Python)		Understands on-call for Python: responds to alerts, diagnoses exceptions. Uses runbooks.
Cloud Engineer		Understands on-call basics for cloud infrastructure: alert triage procedures, runbook following for common incidents, and escalation paths. Participates in on-call rotations as secondary responder. Follows team practices for incident documentation and handoff procedures.
Database Engineer / DBA		Participates in on-call rotation for the database tier: follows runbooks for alerts (high CPU, disk space, replication lag), escalates complex issues. Documents incidents and performs basic remediation actions.
DevOps Engineer		Understands on-call principles: duty schedules, escalation, incident management. Participates in on-call under senior engineer guidance, responds to alerts following runbooks. Knows tools: PagerDuty, Opsgenie, VictorOps.
DevSecOps Engineer		Participates in on-call rotation: responds to PagerDuty/OpsGenie alerts, follows runbooks for typical incidents. Documents actions and results. Understands escalation procedures. Studies basic security incidents: compromised credentials, suspicious login, certificate expiration. Maintains incident log.
Game Server Developer		Understands on-call basics for game server infrastructure: player-impacting alert triage, game service health monitoring, and escalation procedures for live game issues. Participates in on-call rotations for game server operations. Follows team runbooks for common game server incidents.
Network Engineer		Knows basic on-call management concepts for network engineering and can apply them in typical tasks. Uses standard tools and follows established team practices. Understands when and why this approach is used.
Platform Engineer		Participates in on-call rotation for platform services: follows runbooks, escalates by procedure. Uses PagerDuty/OpsGenie for incident management. Documents incidents and actions in timeline. Understands severity levels and response time SLA for the platform.
Security Analyst		Understands on-call basics for security operations: security alert triage procedures, SIEM dashboard monitoring, and security incident escalation paths. Participates in SOC rotations as junior analyst. Follows team procedures for security event investigation and documentation.
Site Reliability Engineer (SRE)		Participates in on-call: follows escalation procedures, uses PagerDuty for alert management. Documents incidents. Hands off duty with handoff notes.

角色	必要性	描述
Backend Developer (Go)		Manages on-call: creates runbooks for Go services, configures alerting, post-incident review.
Backend Developer (Java/Kotlin)		Manages on-call: creates runbooks for JVM issues, configures alerting, GC troubleshooting guides.
Backend Developer (Python)		Manages on-call: runbooks for Python services, alerting configuration, incident response.
Cloud Engineer		Participates in on-call rotation for cloud infrastructure. Configures PagerDuty/OpsGenie with escalation, responds to CloudWatch and Prometheus alerts. Classifies incidents by severity, performs initial diagnostics — checking metrics, logs, cloud provider status pages.
Database Engineer / DBA		Handles database incidents independently: diagnosing deadlocks, query performance degradation, replication breaks. Writes and updates runbooks. Conducts post-incident reviews for database-related incidents.
DevOps Engineer		Configures on-call processes: schedules in PagerDuty/Opsgenie, escalation policies, alert routing rules. Creates runbooks for typical incidents, automates initial diagnostics. Conducts post-mortems and tracks action items.
DevSecOps Engineer		Configures on-call processes for security incidents: escalation policies, severity classification, notification channels. Creates runbooks for security on-call: credential compromise, DDoS, data breach, ransomware. Integrates PagerDuty with SIEM alerts. Conducts weekly on-call review and trend analysis.
Engineering Manager		Configures on-call management for team services: alert routing policies, rotation schedules with fair load distribution, and escalation chains. Creates dashboards for on-call health metrics (alert volume, MTTA, pages per rotation). Participates in incident response and conducts post-incident reviews.
Game Server Developer		Configures on-call management for game servers: player impact-based alert prioritization, game-specific health dashboards (CCU, match completion rate, latency percentiles), and automated remediation for common game service issues. Analyzes incident patterns to reduce toil. Participates in live ops on-call rotations.
Network Engineer		Confidently applies on-call management for network engineering in non-standard tasks. Independently selects the optimal approach and tools. Analyzes trade-offs and proposes improvements to existing solutions.
Platform Engineer		Configures on-call infrastructure for the platform: PagerDuty service dependencies, escalation policies, schedule overrides. Creates and updates runbooks for common incidents. Implements automated diagnostics: auto-remediation scripts, diagnostic dashboards. Conducts incident reviews.
Security Analyst		Configures on-call management for security operations: security alert correlation rules, threat severity-based triage automation, and SOC shift handoff procedures. Creates security monitoring dashboards with threat intelligence integration. Analyzes security incident patterns for detection rule improvement.
Site Reliability Engineer (SRE)		Manages on-call process: configures PagerDuty schedules and escalation policies, writes runbooks for typical alerts. Analyzes on-call burden: toil, alert quality, false positive rate.
Technical Lead		Configures on-call management for product services: service-level alerting aligned with SLOs, PagerDuty/OpsGenie integration, and on-call rotation optimization. Creates runbooks for common incidents and automates remediation where possible. Mentors team on incident response practices and blameless post-mortems.

角色	必要性	描述
Backend Developer (Go)		Designs on-call practices: escalation policies, automated remediation, incident response playbooks.
Backend Developer (Java/Kotlin)		Designs on-call: automated JVM diagnostics, escalation policies, incident playbooks.
Backend Developer (Python)		Designs on-call: automated diagnostics, escalation policies, incident playbooks.
Cloud Engineer	必要	Designs on-call processes for cloud team: alert routing by services and severity, runbooks for common incidents (disk full, OOM, AZ failure), post-incident review process. Optimizes alert noise — deduplication, suppression rules, actionable alerts. Reduces toil through automation.
Database Engineer / DBA	必要	Designs on-call processes for the DBA team: alert routing by severity, escalation policies, runbook automation. Mentors junior DBAs in incident response. Implements automated remediation for common database issues.
DevOps Engineer	必要	Designs incident management process: automated incident classification, PagerDuty integration with Slack/Jira/StatusPage. Implements incident commander role, automates communication through ChatOps. Configures SLO-based alerting to reduce alert fatigue.
DevSecOps Engineer	必要	Develops corporate security Incident Management process: Incident Commander role, communication templates, stakeholder notification. Introduces automated triage through PagerDuty Event Intelligence. Creates tiered response: L1 (SOC), L2 (Security Engineering), L3 (Principal). Conducts GameDay exercises.
Engineering Manager	必要	Designs on-call management strategy for multiple team services: SLO-based alerting architecture, automated incident response workflows, and on-call sustainability practices (burnout prevention, fair rotation). Implements observability-driven incident detection. Conducts post-mortem reviews and drives systemic improvements. Defines MTTD/MTTR targets and tracks reliability metrics.
Game Server Developer	必要	Designs on-call management architecture for game server platform: live ops incident response automation, predictive alerting for player experience degradation, and cross-region incident coordination. Implements game-specific SLIs (match quality, connection stability). Conducts post-mortems focused on player impact analysis. Defines on-call sustainability practices for game operations teams.
Network Engineer		Expertly applies on-call management for network engineering to design complex systems. Optimizes existing solutions and prevents architectural mistakes. Conducts code reviews and trains colleagues on best practices.
Platform Engineer	必要	Designs incident management system for IDP: automated incident creation, war room automation, stakeholder communication. Implements incident retrospective process with action item tracking. Creates self-healing automation for typical platform incidents (node failures, OOM).
Security Analyst	必要	Designs on-call management for security operations center: advanced threat detection automation, SOAR integration for incident response orchestration, and cross-functional security incident coordination. Implements distributed tracing for security event correlation. Defines security-specific SLIs (detection time, containment time). Conducts security post-mortems and drives detection engineering improvements.
Site Reliability Engineer (SRE)	必要	Optimizes on-call: alert tuning for noise reduction, automated remediation for common issues. Designs runbook automation. Analyzes on-call metrics and creates improvement plan.
Technical Lead	必要	Designs observability strategy with On-Call Management. Implements distributed tracing. Defines SLIs/SLOs. Conducts post-mortems.

角色	必要性	描述
Backend Developer (Go)		Defines on-call standards for the Go team: rotation schedules, escalation policies, runbook requirements. Conducts incident post-mortems, improves alert quality and reduces alert fatigue.
Backend Developer (Java/Kotlin)		Defines on-call standards for Java service platform: rotation policies aligned with team capacity, SLA-based escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Java service failures. Drives adoption of observability-driven incident management.
Backend Developer (Python)		Defines on-call standards for Python service platform: rotation policies, SLA-driven escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Python service failures (memory leaks, GIL contention, dependency issues). Drives adoption of reliability engineering practices.
Cloud Engineer	必要	Defines on-call strategy for the cloud organization: follow-the-sun rotation, tier-1/tier-2 escalation, incident commander role. Introduces incident management process (ITIL/SRE), blameless postmortems, reliability metrics (MTTA, MTTR). Manages on-call load balancing and burnout prevention.
Database Engineer / DBA	必要	Defines on-call standards for the database tier: rotation schedule, coverage requirements, alert fatigue reduction. Coordinates cross-team incident response. Conducts on-call retrospectives and improves processes.
DevOps Engineer	必要	Defines organizational incident management strategy: severity level standards, escalation matrices, communication protocols. Designs blameless post-mortem process, MTTR/MTTA metrics, SRE on-call program with sustainable rotation.
DevSecOps Engineer	必要	Defines incident management strategy for the security organization. Manages SOC team with 24/7 coverage. Builds metrics: MTTA, MTTD, MTTR, false positive rate. Introduces post-incident review processes with actionable improvements. Coordinates with legal, PR, management during major incidents.
Engineering Manager	必要	Defines on-call management strategy for product organization: SLO-based approach to incident management, on-call sustainability metrics, and cross-team incident coordination processes. Establishes post-mortem culture and tracks systemic improvements. Optimizes MTTD/MTTR across services.
Game Server Developer	必要	Defines on-call management strategy for game operations: live game incident management framework, player impact SLA definitions, and cross-studio incident coordination. Establishes post-mortem culture focused on player experience improvements. Drives adoption of proactive monitoring and automated remediation.
Network Engineer		Establishes on-call management standards for the network engineering team and makes architectural decisions. Defines the technical roadmap incorporating this skill. Mentors senior engineers and influences practices of adjacent teams.
Platform Engineer	必要	Defines organizational incident management strategy: on-call expectations, compensation, burnout prevention. Leads MTTR improvement through automation and tooling. Designs cross-team incident coordination for complex incidents. Creates incident readiness program.
Security Analyst	必要	Defines on-call management strategy for security operations. Establishes SOC shift management, threat response SLA targets, and security incident escalation framework. Coordinates cross-team security incident response. Optimizes MTTD/MTTR for security events across the organization.
Site Reliability Engineer (SRE)	必要	Defines on-call standards: rotation policies, compensation, workload balance. Implements on-call metrics (interruptions, sleep impact). Builds sustainable on-call culture.
Technical Lead	必要	Defines on-call management strategy for the product. Establishes SLO-based alerting approach, incident management framework, and post-mortem process. Coordinates cross-team incident response. Optimizes MTTD/MTTR through observability improvements and automated remediation.

角色	必要性	描述
Backend Developer (Go)		Shapes on-call strategy: platform incident management, automated response, governance.
Backend Developer (Java/Kotlin)		Shapes incident management strategy for Java platform organization: platform-level on-call automation, automated incident response governance, and reliability culture. Drives adoption of SRE practices across Java service teams. Establishes enterprise-wide incident management standards.
Backend Developer (Python)		Shapes incident management strategy for Python platform organization: platform-level on-call automation, automated incident response patterns, and reliability culture. Drives adoption of SRE practices across Python service teams. Establishes enterprise-wide incident management and observability standards.
Cloud Engineer	必要	Shapes enterprise-level incident management framework: unified incident process for multi-cloud, automated incident response (AWS Systems Manager, PagerDuty Rundeck), AIOps for anomaly detection. Defines organizational resilience strategy and chaos engineering program.
Database Engineer / DBA	必要	Shapes incident management strategy for the data platform: automated incident response, AI-assisted diagnostics, cross-database impact analysis. Defines on-call sustainability and investments in automation for database operations.
DevOps Engineer	必要	Develops corporate incident management culture: SRE principles, toil budgets, 80% incident automation. Defines AIOps platform architecture: ML-powered alert correlation, automated remediation, predictive incident prevention.
DevSecOps Engineer	必要	Architecturally designs enterprise Incident Response and Cyber Resilience program. Defines SOC strategy: automation level, staffing model, tooling. Develops Business Continuity Plan considering cyber threats. Builds IR maturity metrics for board-level reporting. Influences security budget.
Engineering Manager	必要	Defines organizational observability strategy. Implements platform solutions. Builds reliability culture. Establishes enterprise SLO framework.
Game Server Developer	必要	Defines organizational reliability strategy for game infrastructure: enterprise SLO framework for live games, platform observability and incident management solutions, and reliability culture across game studios. Drives adoption of SRE practices for game operations at scale.
Network Engineer		Shapes on-call management strategy for network engineering at the organizational level. Defines best practices and influences technology choices beyond their own team. Is a recognized expert in this area.
Platform Engineer	必要	Shapes operational excellence culture: blameless postmortems, learning from incidents, reliability as feature. Defines AIOps strategy for automated incident response. Advises executives on investment in on-call tooling and reliability engineering for a sustainable platform.
Security Analyst	必要	Defines organizational strategy for security operations and incident management. Implements platform SOC solutions with AI-driven threat detection and automated response. Builds security reliability culture across the organization. Establishes enterprise SLO framework for security event handling.
Site Reliability Engineer (SRE)	必要	Designs organizational on-call model: follow-the-sun, tiered support, shared on-call between SRE and dev teams. Defines on-call governance and toil elimination strategy.
Technical Lead	必要	Defines the organization's observability strategy. Implements platform solutions. Shapes reliability culture. Defines enterprise SLO framework.

Junior 12 要求

Backend Developer (Go)

Understands on-call for Go services: responds to alerts, uses runbooks. Participates in incident response.
Backend Developer (Java/Kotlin)

Understands on-call for Java: responds to alerts, thread dumps, heap dumps. Uses runbooks for troubleshooting.
Backend Developer (Python)

Understands on-call for Python: responds to alerts, diagnoses exceptions. Uses runbooks.

Cloud Engineer

Understands on-call basics for cloud infrastructure: alert triage procedures, runbook following for common incidents, and escalation paths. Participates in on-call rotations as secondary responder. Follows team practices for incident documentation and handoff procedures.
Database Engineer / DBA

Participates in on-call rotation for the database tier: follows runbooks for alerts (high CPU, disk space, replication lag), escalates complex issues. Documents incidents and performs basic remediation actions.
DevOps Engineer

Understands on-call principles: duty schedules, escalation, incident management. Participates in on-call under senior engineer guidance, responds to alerts following runbooks. Knows tools: PagerDuty, Opsgenie, VictorOps.
DevSecOps Engineer

Participates in on-call rotation: responds to PagerDuty/OpsGenie alerts, follows runbooks for typical incidents. Documents actions and results. Understands escalation procedures. Studies basic security incidents: compromised credentials, suspicious login, certificate expiration. Maintains incident log.
Game Server Developer

Understands on-call basics for game server infrastructure: player-impacting alert triage, game service health monitoring, and escalation procedures for live game issues. Participates in on-call rotations for game server operations. Follows team runbooks for common game server incidents.
Network Engineer

Knows basic on-call management concepts for network engineering and can apply them in typical tasks. Uses standard tools and follows established team practices. Understands when and why this approach is used.
Platform Engineer

Participates in on-call rotation for platform services: follows runbooks, escalates by procedure. Uses PagerDuty/OpsGenie for incident management. Documents incidents and actions in timeline. Understands severity levels and response time SLA for the platform.
Security Analyst

Understands on-call basics for security operations: security alert triage procedures, SIEM dashboard monitoring, and security incident escalation paths. Participates in SOC rotations as junior analyst. Follows team procedures for security event investigation and documentation.
Site Reliability Engineer (SRE)

Participates in on-call: follows escalation procedures, uses PagerDuty for alert management. Documents incidents. Hands off duty with handoff notes.

Middle 14 要求

Backend Developer (Go)

Manages on-call: creates runbooks for Go services, configures alerting, post-incident review.
Backend Developer (Java/Kotlin)

Manages on-call: creates runbooks for JVM issues, configures alerting, GC troubleshooting guides.
Backend Developer (Python)

Manages on-call: runbooks for Python services, alerting configuration, incident response.

Cloud Engineer

Participates in on-call rotation for cloud infrastructure. Configures PagerDuty/OpsGenie with escalation, responds to CloudWatch and Prometheus alerts. Classifies incidents by severity, performs initial diagnostics — checking metrics, logs, cloud provider status pages.
Database Engineer / DBA

Handles database incidents independently: diagnosing deadlocks, query performance degradation, replication breaks. Writes and updates runbooks. Conducts post-incident reviews for database-related incidents.
DevOps Engineer

Configures on-call processes: schedules in PagerDuty/Opsgenie, escalation policies, alert routing rules. Creates runbooks for typical incidents, automates initial diagnostics. Conducts post-mortems and tracks action items.
DevSecOps Engineer

Configures on-call processes for security incidents: escalation policies, severity classification, notification channels. Creates runbooks for security on-call: credential compromise, DDoS, data breach, ransomware. Integrates PagerDuty with SIEM alerts. Conducts weekly on-call review and trend analysis.
Engineering Manager

Configures on-call management for team services: alert routing policies, rotation schedules with fair load distribution, and escalation chains. Creates dashboards for on-call health metrics (alert volume, MTTA, pages per rotation). Participates in incident response and conducts post-incident reviews.
Game Server Developer

Configures on-call management for game servers: player impact-based alert prioritization, game-specific health dashboards (CCU, match completion rate, latency percentiles), and automated remediation for common game service issues. Analyzes incident patterns to reduce toil. Participates in live ops on-call rotations.
Network Engineer

Confidently applies on-call management for network engineering in non-standard tasks. Independently selects the optimal approach and tools. Analyzes trade-offs and proposes improvements to existing solutions.
Platform Engineer

Configures on-call infrastructure for the platform: PagerDuty service dependencies, escalation policies, schedule overrides. Creates and updates runbooks for common incidents. Implements automated diagnostics: auto-remediation scripts, diagnostic dashboards. Conducts incident reviews.
Security Analyst

Configures on-call management for security operations: security alert correlation rules, threat severity-based triage automation, and SOC shift handoff procedures. Creates security monitoring dashboards with threat intelligence integration. Analyzes security incident patterns for detection rule improvement.
Site Reliability Engineer (SRE)

Manages on-call process: configures PagerDuty schedules and escalation policies, writes runbooks for typical alerts. Analyzes on-call burden: toil, alert quality, false positive rate.
Technical Lead

Configures on-call management for product services: service-level alerting aligned with SLOs, PagerDuty/OpsGenie integration, and on-call rotation optimization. Creates runbooks for common incidents and automates remediation where possible. Mentors team on incident response practices and blameless post-mortems.

Senior 14 要求

Backend Developer (Go)

Designs on-call practices: escalation policies, automated remediation, incident response playbooks.
Backend Developer (Java/Kotlin)

Designs on-call: automated JVM diagnostics, escalation policies, incident playbooks.
Backend Developer (Python)

Designs on-call: automated diagnostics, escalation policies, incident playbooks.

Cloud Engineer
必要

Designs on-call processes for cloud team: alert routing by services and severity, runbooks for common incidents (disk full, OOM, AZ failure), post-incident review process. Optimizes alert noise — deduplication, suppression rules, actionable alerts. Reduces toil through automation.
Database Engineer / DBA
必要

Designs on-call processes for the DBA team: alert routing by severity, escalation policies, runbook automation. Mentors junior DBAs in incident response. Implements automated remediation for common database issues.
DevOps Engineer
必要

Designs incident management process: automated incident classification, PagerDuty integration with Slack/Jira/StatusPage. Implements incident commander role, automates communication through ChatOps. Configures SLO-based alerting to reduce alert fatigue.
DevSecOps Engineer
必要

Develops corporate security Incident Management process: Incident Commander role, communication templates, stakeholder notification. Introduces automated triage through PagerDuty Event Intelligence. Creates tiered response: L1 (SOC), L2 (Security Engineering), L3 (Principal). Conducts GameDay exercises.
Engineering Manager
必要

Designs on-call management strategy for multiple team services: SLO-based alerting architecture, automated incident response workflows, and on-call sustainability practices (burnout prevention, fair rotation). Implements observability-driven incident detection. Conducts post-mortem reviews and drives systemic improvements. Defines MTTD/MTTR targets and tracks reliability metrics.
Game Server Developer
必要

Designs on-call management architecture for game server platform: live ops incident response automation, predictive alerting for player experience degradation, and cross-region incident coordination. Implements game-specific SLIs (match quality, connection stability). Conducts post-mortems focused on player impact analysis. Defines on-call sustainability practices for game operations teams.
Network Engineer

Expertly applies on-call management for network engineering to design complex systems. Optimizes existing solutions and prevents architectural mistakes. Conducts code reviews and trains colleagues on best practices.
Platform Engineer
必要

Designs incident management system for IDP: automated incident creation, war room automation, stakeholder communication. Implements incident retrospective process with action item tracking. Creates self-healing automation for typical platform incidents (node failures, OOM).
Security Analyst
必要

Designs on-call management for security operations center: advanced threat detection automation, SOAR integration for incident response orchestration, and cross-functional security incident coordination. Implements distributed tracing for security event correlation. Defines security-specific SLIs (detection time, containment time). Conducts security post-mortems and drives detection engineering improvements.
Site Reliability Engineer (SRE)
必要

Optimizes on-call: alert tuning for noise reduction, automated remediation for common issues. Designs runbook automation. Analyzes on-call metrics and creates improvement plan.
Technical Lead
必要

Designs observability strategy with On-Call Management. Implements distributed tracing. Defines SLIs/SLOs. Conducts post-mortems.

Lead / Staff 14 要求

Backend Developer (Go)

Defines on-call standards for the Go team: rotation schedules, escalation policies, runbook requirements. Conducts incident post-mortems, improves alert quality and reduces alert fatigue.
Backend Developer (Java/Kotlin)

Defines on-call standards for Java service platform: rotation policies aligned with team capacity, SLA-based escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Java service failures. Drives adoption of observability-driven incident management.
Backend Developer (Python)

Defines on-call standards for Python service platform: rotation policies, SLA-driven escalation procedures, and post-mortem process requirements. Establishes runbook quality standards and automated remediation patterns for common Python service failures (memory leaks, GIL contention, dependency issues). Drives adoption of reliability engineering practices.

Cloud Engineer
必要

Defines on-call strategy for the cloud organization: follow-the-sun rotation, tier-1/tier-2 escalation, incident commander role. Introduces incident management process (ITIL/SRE), blameless postmortems, reliability metrics (MTTA, MTTR). Manages on-call load balancing and burnout prevention.
Database Engineer / DBA
必要

Defines on-call standards for the database tier: rotation schedule, coverage requirements, alert fatigue reduction. Coordinates cross-team incident response. Conducts on-call retrospectives and improves processes.
DevOps Engineer
必要

Defines organizational incident management strategy: severity level standards, escalation matrices, communication protocols. Designs blameless post-mortem process, MTTR/MTTA metrics, SRE on-call program with sustainable rotation.
DevSecOps Engineer
必要

Defines incident management strategy for the security organization. Manages SOC team with 24/7 coverage. Builds metrics: MTTA, MTTD, MTTR, false positive rate. Introduces post-incident review processes with actionable improvements. Coordinates with legal, PR, management during major incidents.
Engineering Manager
必要

Defines on-call management strategy for product organization: SLO-based approach to incident management, on-call sustainability metrics, and cross-team incident coordination processes. Establishes post-mortem culture and tracks systemic improvements. Optimizes MTTD/MTTR across services.
Game Server Developer
必要

Defines on-call management strategy for game operations: live game incident management framework, player impact SLA definitions, and cross-studio incident coordination. Establishes post-mortem culture focused on player experience improvements. Drives adoption of proactive monitoring and automated remediation.
Network Engineer

Establishes on-call management standards for the network engineering team and makes architectural decisions. Defines the technical roadmap incorporating this skill. Mentors senior engineers and influences practices of adjacent teams.
Platform Engineer
必要

Defines organizational incident management strategy: on-call expectations, compensation, burnout prevention. Leads MTTR improvement through automation and tooling. Designs cross-team incident coordination for complex incidents. Creates incident readiness program.
Security Analyst
必要

Defines on-call management strategy for security operations. Establishes SOC shift management, threat response SLA targets, and security incident escalation framework. Coordinates cross-team security incident response. Optimizes MTTD/MTTR for security events across the organization.
Site Reliability Engineer (SRE)
必要

Defines on-call standards: rotation policies, compensation, workload balance. Implements on-call metrics (interruptions, sleep impact). Builds sustainable on-call culture.
Technical Lead
必要

Defines on-call management strategy for the product. Establishes SLO-based alerting approach, incident management framework, and post-mortem process. Coordinates cross-team incident response. Optimizes MTTD/MTTR through observability improvements and automated remediation.

Principal 14 要求

Backend Developer (Go)

Shapes on-call strategy: platform incident management, automated response, governance.
Backend Developer (Java/Kotlin)

Shapes incident management strategy for Java platform organization: platform-level on-call automation, automated incident response governance, and reliability culture. Drives adoption of SRE practices across Java service teams. Establishes enterprise-wide incident management standards.
Backend Developer (Python)

Shapes incident management strategy for Python platform organization: platform-level on-call automation, automated incident response patterns, and reliability culture. Drives adoption of SRE practices across Python service teams. Establishes enterprise-wide incident management and observability standards.

Cloud Engineer
必要

Shapes enterprise-level incident management framework: unified incident process for multi-cloud, automated incident response (AWS Systems Manager, PagerDuty Rundeck), AIOps for anomaly detection. Defines organizational resilience strategy and chaos engineering program.
Database Engineer / DBA
必要

Shapes incident management strategy for the data platform: automated incident response, AI-assisted diagnostics, cross-database impact analysis. Defines on-call sustainability and investments in automation for database operations.
DevOps Engineer
必要

Develops corporate incident management culture: SRE principles, toil budgets, 80% incident automation. Defines AIOps platform architecture: ML-powered alert correlation, automated remediation, predictive incident prevention.
DevSecOps Engineer
必要

Architecturally designs enterprise Incident Response and Cyber Resilience program. Defines SOC strategy: automation level, staffing model, tooling. Develops Business Continuity Plan considering cyber threats. Builds IR maturity metrics for board-level reporting. Influences security budget.
Engineering Manager
必要

Defines organizational observability strategy. Implements platform solutions. Builds reliability culture. Establishes enterprise SLO framework.
Game Server Developer
必要

Defines organizational reliability strategy for game infrastructure: enterprise SLO framework for live games, platform observability and incident management solutions, and reliability culture across game studios. Drives adoption of SRE practices for game operations at scale.
Network Engineer

Shapes on-call management strategy for network engineering at the organizational level. Defines best practices and influences technology choices beyond their own team. Is a recognized expert in this area.
Platform Engineer
必要

Shapes operational excellence culture: blameless postmortems, learning from incidents, reliability as feature. Defines AIOps strategy for automated incident response. Advises executives on investment in on-call tooling and reliability engineering for a sustainable platform.
Security Analyst
必要

Defines organizational strategy for security operations and incident management. Implements platform SOC solutions with AI-driven threat detection and automated response. Builds security reliability culture across the organization. Establishes enterprise SLO framework for security event handling.
Site Reliability Engineer (SRE)
必要

Designs organizational on-call model: follow-the-sun, tiered support, shared on-call between SRE and dev teams. Defines on-call governance and toil elimination strategy.
Technical Lead
必要

Defines the organization's observability strategy. Implements platform solutions. Shapes reliability culture. Defines enterprise SLO framework.

社区

👁 关注 ✏️ 建议修改

正在加载评论...