领域
Data Engineering
技能档案
PySpark, Spark SQL, DataFrames, partitioning, optimization, Spark on K8s
角色数
3
包含此技能的角色
级别数
5
结构化成长路径
必要要求
13
其余 2 个可选
Data Engineering
Batch Processing
2026/3/17
选择当前级别并对比期望。下方卡片显示晋升所需掌握的内容。
表格展示从初级到首席的技能深度变化。点击行查看详情。
| 角色 | 必要性 | 描述 |
|---|---|---|
| Data Engineer | 必要 | Understands Apache Spark fundamentals for data engineering: RDD/DataFrame APIs, basic transformations and actions, reading/writing Parquet/CSV/JSON. Follows team patterns for PySpark job structure, SparkSession configuration, and cluster resource allocation. |
| Data Scientist | Understands Apache Spark fundamentals for data science: Spark DataFrames for large-scale data analysis, basic Spark SQL queries, and MLlib for distributed model training. Follows team patterns for notebook-based Spark workflows and feature engineering at scale. | |
| ML Engineer | 必要 | Understands Apache Spark fundamentals for ML engineering: Spark MLlib pipelines, feature transformers, and distributed model training/inference. Follows team patterns for PySpark ML workflows, model serialization, and integration with MLflow tracking. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Data Engineer | 必要 | Independently implements Spark data pipelines: optimizes shuffle operations and partitioning strategies, implements Structured Streaming for real-time ETL, manages Delta Lake tables with ACID transactions. Tunes Spark configurations for memory, parallelism, and cost efficiency. |
| Data Scientist | Independently uses Spark for large-scale analysis: writes optimized Spark SQL for complex aggregations, implements distributed feature engineering with window functions, and uses MLlib for hyperparameter tuning at scale. Manages Spark resource allocation for interactive analytics. | |
| ML Engineer | 必要 | Uses PySpark for large-scale feature engineering. Optimizes Spark jobs (partitioning, caching, broadcast joins). Uses Spark ML for distributed model training. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Data Engineer | 必要 | Designs Spark-based data platform architecture: multi-tenant cluster management, cost-optimized workload scheduling with YARN/Kubernetes, and lakehouse architecture with Delta Lake/Iceberg. Implements data quality frameworks, CDC pipelines, and Spark application performance monitoring. |
| Data Scientist | 必要 | Designs Spark-based analytical frameworks: custom MLlib transformers for domain-specific features, distributed experiment pipelines, and Spark integration with GPU-accelerated training (Rapids). Optimizes end-to-end ML workflows from data preparation to model serving at petabyte scale. |
| ML Engineer | 必要 | Designs Spark-based ML pipelines for production. Optimizes Spark for ML workloads: memory tuning, shuffle optimization. Integrates Spark with ML platform (MLflow, feature store). |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Data Engineer | 必要 | Defines Spark standards: coding guidelines, job submission patterns, resource allocation policies. Chooses between PySpark and Spark SQL by scenario. Implements unit testing for Spark jobs through chispa. |
| Data Scientist | 必要 | Defines data engineering strategy. Shapes data platform. Coordinates data teams. Optimizes data mesh/data fabric approaches. |
| ML Engineer | 必要 | Defines Spark strategy for ML data processing. Evaluates Spark vs alternatives (Dask, Ray) for ML workloads. Designs distributed computing architecture for ML. |
| 角色 | 必要性 | 描述 |
|---|---|---|
| Data Engineer | 必要 | Designs platform Spark strategy: EMR vs Databricks vs self-hosted, cluster sizing, dynamic allocation. Defines when Spark vs DuckDB vs Polars. Plans migration to Spark 4.0. |
| Data Scientist | 必要 | Defines organizational data strategy. Designs enterprise data platform. Establishes data governance framework. |
| ML Engineer | 必要 | Defines distributed processing strategy for enterprise ML. Designs data processing layer for ML platform. Evaluates novel distributed frameworks. |