Site Reliability Engineer (SRE)
Ensuring reliability, scalability, and performance of production systems
Site Reliability Engineer (SRE)是DevOps & SRE族群中的角色。涵盖5个级别的61项技能(从Junior到Principal)。其中139项为必备技能。关键领域:Programming Fundamentals, Backend Development, Database Management。
技术栈
各级别重点
Monitoring SLI/SLO. Participating in on-call rotation. Writing runbooks. Automating routine operations. Incident analysis.
Defining SLI/SLO/SLA. Designing monitoring. Capacity planning. Automating incident response. Post-mortem analysis.
Designing highly available systems. Chaos engineering. Performance engineering. Error budgets. Coordination with development.
SRE strategy. Reliability culture. SLO standards. Incident management processes. Coordination with product.
Enterprise reliability strategy. Multi-region architecture. SRE culture at scale. Industry best practices.
技能矩阵
61 技能 × 5 级别. 点击单元格查看详情。
AI-Assisted Development
4 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| GitHub Copilot | A | W | A | E | E |
| Cursor IDE | A | W | A | E | E |
| ChatGPT / Claude | A | W | A | E | E |
| Prompt Engineering for Code | A | W | A | E | E |
API & Integration
3 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| REST API Design | A | W | A | E | E |
| GraphQL Design | A | W | A | E | E |
| API Documentation | A | W | A | E | E |
Architecture & System Design
4 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| System Design Fundamentals | A | W | A | E | E |
| High Load Architecture | A | W | A | E | E |
| Capacity Planning | A | W | A | E | E |
| Disaster Recovery Design | A | W | A | E | E |
Backend Development
3 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Python Web Frameworks | A | W | A | E | E |
| Apache Kafka | A | W | A | E | E |
| Redis | A | W | A | E | E |
Cloud & Infrastructure
9 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Docker | A | W | A | E | E |
| Kubernetes Core | A | W | A | E | E |
| Kubernetes Advanced | A | W | A | E | E |
| Helm | A | W | A | E | E |
| Terraform | A | W | A | E | E |
| AWS | A | W | A | E | E |
| Network Fundamentals | A | W | A | E | E |
| Load Balancing | A | W | A | E | E |
| VPN & Network Isolation | A | W | A | E | E |
Database Management
3 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| PostgreSQL | A | W | A | E | E |
| Database Indexing | A | W | A | E | E |
| Query Optimization | A | W | A | E | E |
DevOps & CI/CD
3 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| GitHub Actions / GitLab CI | A | W | A | E | E |
| GitOps Practices | A | W | A | E | E |
| ArgoCD | A | W | A | E | E |
Observability & Monitoring
11 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Structured Logging | A | W | A | E | E |
| ELK Stack | A | W | A | E | E |
| Grafana Loki | A | W | A | E | E |
| Prometheus & Grafana | A | W | A | E | E |
| Custom Business Metrics | A | W | A | E | E |
| OpenTelemetry | A | W | A | E | E |
| Jaeger / Grafana Tempo | A | W | A | E | E |
| Continuous Profiling | A | W | A | E | E |
| APM Tools | A | W | A | E | E |
| SLI / SLO / SLA | A | W | A | E | E |
| On-Call Management | A | W | A | E | E |
Performance Engineering
1 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Latency Optimization | A | W | A | E | E |
Programming Fundamentals
9 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Algorithms & Complexity | A | W | A | E | E |
| Data Structures | A | W | A | E | E |
| OOP & SOLID Principles | A | W | A | E | E |
| Design Patterns | A | W | A | E | E |
| Multithreading | A | W | A | E | E |
| Async Programming | A | W | A | E | E |
| Code Quality & Refactoring | A | W | A | E | E |
| Type Safety & Type Systems | A | W | A | E | E |
| Memory Management | A | W | A | E | E |
Security
5 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| OWASP & Application Security | A | W | A | E | E |
| Secure Coding Practices | A | W | A | E | E |
| Secrets Management | A | W | A | E | E |
| JWT / OAuth2 / OIDC | A | W | A | E | E |
| Incident Response Process | A | W | A | E | E |
Testing & QA
4 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Unit Testing | A | W | A | E | E |
| Integration Testing | A | W | A | E | E |
| E2E Testing | A | W | A | E | E |
| Chaos Engineering | A | W | A | E | E |
Version Control & Collaboration
2 技能| 技能 | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Git Advanced | A | W | A | E | E |
| Code Review | A | W | A | E | E |
常见问题
Site Reliability Engineer (SRE)角色需要哪些技能?
Site Reliability Engineer (SRE)角色需要61项技能,其中139项为必备。技能分布在5个级别:从Junior到Principal。 查看完整矩阵.
如何在Site Reliability Engineer (SRE)角色中晋升到下一级别?
使用等级计算器评估您当前的级别并获取个性化建议。系统将显示晋升所需发展的技能。
Site Reliability Engineer (SRE)角色使用什么技术栈?
技术栈包含5种不同级别的技术。 Linux, Prometheus/Grafana, PagerDuty/OpsGenie, Bash/Python scripting, Docker, Kubernetes basics, Kubernetes, Prometheus/Thanos, Grafana/Loki, OpenTelemetry, Terraform, Go/Python, Chaos Monkey basics, Runbook automation, Kubernetes advanced, Chaos Engineering (Litmus/Gremlin), eBPF tools, OpenTelemetry advanced, Custom exporters, Load testing (k6/Gatling)...
社区如何定义Site Reliability Engineer (SRE)角色的要求?
角色要求由社区通过提案系统制定。任何成员都可以提出修改建议,经过投票和专家评审后生效。