Site Reliability Engineer (SRE)
Ensuring reliability, scalability, and performance of production systems
Site Reliability Engineer (SRE) is a role in the DevOps & SRE family. It has 61 skills across 5 levels (from Junior to Principal). 139 skills are mandatory. Key domains: Programming Fundamentals, Backend Development, Database Management.
Technology Stack
Focus by Level
Monitoring SLI/SLO. Participating in on-call rotation. Writing runbooks. Automating routine operations. Incident analysis.
Defining SLI/SLO/SLA. Designing monitoring. Capacity planning. Automating incident response. Post-mortem analysis.
Designing highly available systems. Chaos engineering. Performance engineering. Error budgets. Coordination with development.
SRE strategy. Reliability culture. SLO standards. Incident management processes. Coordination with product.
Enterprise reliability strategy. Multi-region architecture. SRE culture at scale. Industry best practices.
Skill Matrix
61 skills × 5 levels. Click on a cell for details.
AI-Assisted Development
4 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| GitHub Copilot | A | W | A | E | E |
| Cursor IDE | A | W | A | E | E |
| ChatGPT / Claude | A | W | A | E | E |
| Prompt Engineering for Code | A | W | A | E | E |
API & Integration
3 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| REST API Design | A | W | A | E | E |
| GraphQL Design | A | W | A | E | E |
| API Documentation | A | W | A | E | E |
Architecture & System Design
4 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| System Design Fundamentals | A | W | A | E | E |
| High Load Architecture | A | W | A | E | E |
| Capacity Planning | A | W | A | E | E |
| Disaster Recovery Design | A | W | A | E | E |
Backend Development
3 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Python Web Frameworks | A | W | A | E | E |
| Apache Kafka | A | W | A | E | E |
| Redis | A | W | A | E | E |
Cloud & Infrastructure
9 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Docker | A | W | A | E | E |
| Kubernetes Core | A | W | A | E | E |
| Kubernetes Advanced | A | W | A | E | E |
| Helm | A | W | A | E | E |
| Terraform | A | W | A | E | E |
| AWS | A | W | A | E | E |
| Network Fundamentals | A | W | A | E | E |
| Load Balancing | A | W | A | E | E |
| VPN & Network Isolation | A | W | A | E | E |
Database Management
3 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| PostgreSQL | A | W | A | E | E |
| Database Indexing | A | W | A | E | E |
| Query Optimization | A | W | A | E | E |
DevOps & CI/CD
3 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| GitHub Actions / GitLab CI | A | W | A | E | E |
| GitOps Practices | A | W | A | E | E |
| ArgoCD | A | W | A | E | E |
Observability & Monitoring
11 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Structured Logging | A | W | A | E | E |
| ELK Stack | A | W | A | E | E |
| Grafana Loki | A | W | A | E | E |
| Prometheus & Grafana | A | W | A | E | E |
| Custom Business Metrics | A | W | A | E | E |
| OpenTelemetry | A | W | A | E | E |
| Jaeger / Grafana Tempo | A | W | A | E | E |
| Continuous Profiling | A | W | A | E | E |
| APM Tools | A | W | A | E | E |
| SLI / SLO / SLA | A | W | A | E | E |
| On-Call Management | A | W | A | E | E |
Performance Engineering
1 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Latency Optimization | A | W | A | E | E |
Programming Fundamentals
9 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Algorithms & Complexity | A | W | A | E | E |
| Data Structures | A | W | A | E | E |
| OOP & SOLID Principles | A | W | A | E | E |
| Design Patterns | A | W | A | E | E |
| Multithreading | A | W | A | E | E |
| Async Programming | A | W | A | E | E |
| Code Quality & Refactoring | A | W | A | E | E |
| Type Safety & Type Systems | A | W | A | E | E |
| Memory Management | A | W | A | E | E |
Security
5 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| OWASP & Application Security | A | W | A | E | E |
| Secure Coding Practices | A | W | A | E | E |
| Secrets Management | A | W | A | E | E |
| JWT / OAuth2 / OIDC | A | W | A | E | E |
| Incident Response Process | A | W | A | E | E |
Testing & QA
4 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Unit Testing | A | W | A | E | E |
| Integration Testing | A | W | A | E | E |
| E2E Testing | A | W | A | E | E |
| Chaos Engineering | A | W | A | E | E |
Version Control & Collaboration
2 skills| Skills | Jun | Mid | Sen | Lead | Princ |
|---|---|---|---|---|---|
| Git Advanced | A | W | A | E | E |
| Code Review | A | W | A | E | E |
FAQ
What skills are needed for the Site Reliability Engineer (SRE) role?
The Site Reliability Engineer (SRE) role requires 61 skills, of which 139 are mandatory. Skills are distributed across 5 levels: from Junior to Principal. See full matrix.
How to advance to the next level in the Site Reliability Engineer (SRE) role?
Use the Grade Calculator to assess your current level and get personalized recommendations. The system will show which skills need to be developed for the next level.
What tech stack is used in the Site Reliability Engineer (SRE) role?
The stack includes 5 technologies at different levels. Linux, Prometheus/Grafana, PagerDuty/OpsGenie, Bash/Python scripting, Docker, Kubernetes basics, Kubernetes, Prometheus/Thanos, Grafana/Loki, OpenTelemetry, Terraform, Go/Python, Chaos Monkey basics, Runbook automation, Kubernetes advanced, Chaos Engineering (Litmus/Gremlin), eBPF tools, OpenTelemetry advanced, Custom exporters, Load testing (k6/Gatling)...
How does the community define requirements for the Site Reliability Engineer (SRE) role?
Role requirements are shaped by the community through a proposal system. Any member can suggest changes that go through voting and expert review.