Domäne
Data Engineering
Skill-Profil
PySpark, Spark SQL, DataFrames, partitioning, optimization, Spark on K8s
Rollen
3
wo dieser Skill vorkommt
Stufen
5
strukturierter Entwicklungspfad
Pflichtanforderungen
13
die anderen 2 optional
Data Engineering
Batch Processing
17.3.2026
Wählen Sie Ihr aktuelles Level und vergleichen Sie die Erwartungen.
Die Tabelle zeigt, wie die Tiefe von Junior bis Principal wächst.
| Rolle | Pflicht | Beschreibung |
|---|---|---|
| Data Engineer | Pflicht | Understands Apache Spark fundamentals for data engineering: RDD/DataFrame APIs, basic transformations and actions, reading/writing Parquet/CSV/JSON. Follows team patterns for PySpark job structure, SparkSession configuration, and cluster resource allocation. |
| Data Scientist | Understands Apache Spark fundamentals for data science: Spark DataFrames for large-scale data analysis, basic Spark SQL queries, and MLlib for distributed model training. Follows team patterns for notebook-based Spark workflows and feature engineering at scale. | |
| ML Engineer | Pflicht | Understands Apache Spark fundamentals for ML engineering: Spark MLlib pipelines, feature transformers, and distributed model training/inference. Follows team patterns for PySpark ML workflows, model serialization, and integration with MLflow tracking. |
| Rolle | Pflicht | Beschreibung |
|---|---|---|
| Data Engineer | Pflicht | Independently implements Spark data pipelines: optimizes shuffle operations and partitioning strategies, implements Structured Streaming for real-time ETL, manages Delta Lake tables with ACID transactions. Tunes Spark configurations for memory, parallelism, and cost efficiency. |
| Data Scientist | Independently uses Spark for large-scale analysis: writes optimized Spark SQL for complex aggregations, implements distributed feature engineering with window functions, and uses MLlib for hyperparameter tuning at scale. Manages Spark resource allocation for interactive analytics. | |
| ML Engineer | Pflicht | Uses PySpark for large-scale feature engineering. Optimizes Spark jobs (partitioning, caching, broadcast joins). Uses Spark ML for distributed model training. |
| Rolle | Pflicht | Beschreibung |
|---|---|---|
| Data Engineer | Pflicht | Designs Spark-based data platform architecture: multi-tenant cluster management, cost-optimized workload scheduling with YARN/Kubernetes, and lakehouse architecture with Delta Lake/Iceberg. Implements data quality frameworks, CDC pipelines, and Spark application performance monitoring. |
| Data Scientist | Pflicht | Designs Spark-based analytical frameworks: custom MLlib transformers for domain-specific features, distributed experiment pipelines, and Spark integration with GPU-accelerated training (Rapids). Optimizes end-to-end ML workflows from data preparation to model serving at petabyte scale. |
| ML Engineer | Pflicht | Designs Spark-based ML pipelines for production. Optimizes Spark for ML workloads: memory tuning, shuffle optimization. Integrates Spark with ML platform (MLflow, feature store). |
| Rolle | Pflicht | Beschreibung |
|---|---|---|
| Data Engineer | Pflicht | Defines Spark standards: coding guidelines, job submission patterns, resource allocation policies. Chooses between PySpark and Spark SQL by scenario. Implements unit testing for Spark jobs through chispa. |
| Data Scientist | Pflicht | Defines data engineering strategy. Shapes data platform. Coordinates data teams. Optimizes data mesh/data fabric approaches. |
| ML Engineer | Pflicht | Defines Spark strategy for ML data processing. Evaluates Spark vs alternatives (Dask, Ray) for ML workloads. Designs distributed computing architecture for ML. |
| Rolle | Pflicht | Beschreibung |
|---|---|---|
| Data Engineer | Pflicht | Designs platform Spark strategy: EMR vs Databricks vs self-hosted, cluster sizing, dynamic allocation. Defines when Spark vs DuckDB vs Polars. Plans migration to Spark 4.0. |
| Data Scientist | Pflicht | Defines organizational data strategy. Designs enterprise data platform. Establishes data governance framework. |
| ML Engineer | Pflicht | Defines distributed processing strategy for enterprise ML. Designs data processing layer for ML platform. Evaluates novel distributed frameworks. |