Domain
Data Engineering
Skill Profile
DataFrame API, performance, lazy evaluation, Arrow backend, tabular data processing
Roles
9
where this skill appears
Levels
5
structured growth path
Mandatory requirements
39
the other 6 optional
Data Engineering
Batch Processing
3/17/2026
Choose your current level and compare expectations. The items below show what to cover to advance to the next level.
The table shows how skill depth grows from Junior to Principal. Click a row to see details.
| Role | Required | Description |
|---|---|---|
| Analytics Engineer | Required | Uses pandas for simple data preparation tasks: reading CSV/Excel, basic filtering and aggregation for ad-hoc analytics. Understands DataFrame operations for exploring data before creating dbt models. |
| BI Analyst | Required | Understands Pandas basics for BI workflows: DataFrame creation from various sources (CSV, Excel, SQL), basic data filtering and aggregation, and pivot table operations. Cleans and prepares datasets for dashboard visualization. Follows team conventions for data transformation scripts. |
| Computer Vision Engineer | Understands Pandas basics for CV data pipelines: image metadata management in DataFrames, annotation dataset loading and manipulation, and dataset splitting for training/validation/test sets. Follows team practices for data preprocessing and augmentation pipeline input preparation. | |
| Data Analyst | Required | Understands Pandas basics for analytical workflows: data loading from multiple sources, exploratory data analysis with describe/info/value_counts, and basic data visualization with matplotlib integration. Cleans datasets by handling missing values and type conversions. Follows team coding standards for Jupyter notebooks. |
| Data Engineer | Required | Processes data through pandas: read_csv/read_parquet, filtering, grouping, merge. Understands DataFrame API. Works with data types and missing values (fillna, dropna). |
| Data Scientist | Understands Pandas basics for data science workflows: feature extraction from raw datasets, statistical analysis with groupby/agg, and data preparation for scikit-learn pipelines. Handles categorical encoding, normalization, and train/test splitting. Follows team practices for reproducible data processing. | |
| LLM Engineer | Understands Pandas basics for LLM data preparation: text dataset loading and preprocessing, training corpus statistics analysis, and prompt/completion pair management. Cleans and formats text data for fine-tuning datasets. Follows team practices for data versioning and quality checks. | |
| ML Engineer | Required | Effectively uses pandas for ML: data loading, EDA, feature engineering. Knows main operations: groupby, merge, pivot. Understands dtypes for memory optimization. |
| NLP Engineer | Required | Knows pandas basics for working with text data: loading corpora, filtering, grouping, basic text preprocessing. Uses str accessor for text column operations. |
| Role | Required | Description |
|---|---|---|
| Analytics Engineer | Required | Applies pandas/polars for complex data preprocessing: merging heterogeneous sources, pivot tables, time series processing. Uses polars to accelerate local processing of large files before loading into the warehouse. |
| BI Analyst | Required | Implements efficient BI data pipelines with Pandas: multi-source data merging, complex aggregation chains, and time-series analysis for trend detection. Optimizes memory usage with proper dtype selection and chunked reading for large files. Creates reusable data transformation functions for recurring analytics tasks. |
| Computer Vision Engineer | Implements CV data management pipelines with Pandas: annotation format conversion (COCO, YOLO, VOC), dataset statistics computation for class imbalance analysis, and batch metadata processing for training monitoring. Integrates Pandas with image processing libraries for efficient data loading. Optimizes DataFrame operations for large annotation datasets. | |
| Data Analyst | Required | Implements efficient analytical pipelines with Pandas: multi-table join strategies, window functions with rolling/expanding, and time-series resampling for different granularities. Uses Polars for performance-critical transformations on large datasets. Creates parameterized analysis pipelines with proper error handling and data validation. |
| Data Engineer | Required | Optimizes processing through pandas/Polars: chunked reading for large files, category dtype for memory, vectorized operations instead of iterrows. Migrates to Polars for performance-critical tasks. |
| Data Scientist | Implements efficient ML data pipelines with Pandas/Polars: feature engineering with complex transformations, automated feature selection based on statistical tests, and efficient data sampling strategies. Uses Polars for high-performance feature computation on large datasets. Creates reproducible feature pipelines with proper versioning. | |
| LLM Engineer | Implements LLM data processing pipelines with Pandas/Polars: text corpus cleaning and deduplication at scale, training data quality metrics computation, and evaluation dataset management. Uses Polars for high-performance text preprocessing on large corpora. Creates reproducible data preparation workflows for fine-tuning and evaluation. | |
| ML Engineer | Required | Optimizes pandas code for ML: vectorized operations, category dtype, chunked reading. Uses Polars for faster processing. Writes efficient feature engineering pipelines. |
| NLP Engineer | Required | Independently processes large text datasets via pandas/Polars. Optimizes memory usage for corpora, applies vectorized string operations, integrates with NLP libraries. |
| Role | Required | Description |
|---|---|---|
| Analytics Engineer | Required | Architects Python pipelines for data that is difficult to process with pure SQL: NLP text processing, geocoding, complex regex parsing. Optimizes pandas/polars for processing millions of rows: chunked reading, lazy evaluation in polars. |
| BI Analyst | Required | Designs data processing architecture with Pandas/Polars for enterprise BI: automated ETL pipelines, data quality frameworks, and real-time analytics data preparation. Optimizes large-scale data transformations with Polars lazy evaluation and partitioned processing. Implements data governance practices including lineage tracking and schema validation. Mentors team on efficient data engineering patterns. |
| Computer Vision Engineer | Required | Designs data management architecture for CV systems with Pandas/Polars: automated dataset versioning, cross-dataset analysis pipelines, and model performance tracking data infrastructure. Implements efficient annotation management for million-scale datasets. Creates data quality frameworks for training data validation. Mentors team on efficient data processing for CV workflows. |
| Data Analyst | Required | Designs data processing architecture for analytical platforms: distributed data pipelines integrating Pandas/Polars with Spark/Dask, automated data quality monitoring, and self-service analytics data preparation. Implements data governance with schema evolution and backward compatibility. Creates organizational data transformation standards and reusable library. Mentors team on performance optimization. |
| Data Engineer | Required | Designs data processing: Polars for single-node high-performance, pandas for quick prototyping, PySpark for distributed. Selects tool by volume and processing pattern. Optimizes memory management. |
| Data Scientist | Required | Designs ML data architecture with Pandas/Polars: feature store integration, automated feature validation pipelines, and efficient data loading for distributed training. Implements data quality frameworks for training data integrity. Creates organization-wide feature engineering libraries and standards. Mentors team on scalable data processing patterns for ML. |
| LLM Engineer | Required | Designs LLM data architecture with Pandas/Polars: training data pipeline infrastructure, evaluation benchmark management, and dataset versioning for model reproducibility. Implements data quality frameworks for detecting contamination, bias, and distribution shift. Creates organization-wide data preparation standards for LLM fine-tuning. Mentors team on scalable text processing. |
| ML Engineer | Required | Designs data processing pipelines for ML. Chooses pandas vs Polars vs Spark for different scales. Optimizes memory usage for large datasets. Writes reusable feature transformers. |
| NLP Engineer | Required | Designs efficient NLP data pipelines with pandas/Polars. Optimizes large text corpus processing, applies partitioning, chunked processing for out-of-memory datasets. |
| Role | Required | Description |
|---|---|---|
| Analytics Engineer | Required | Defines standards for Python vs SQL usage on the analytics platform: when pandas/polars is justified over dbt, templates for Python models in dbt. Implements best practices for reproducible data preparation notebooks. |
| BI Analyst | Required | Defines data engineering strategy for BI organization. Shapes data platform architecture: tool selection (Pandas vs Polars vs Spark), data pipeline standards, and data quality governance. Coordinates data teams on shared transformation libraries and best practices. Optimizes data mesh/data fabric approaches for self-service analytics. |
| Computer Vision Engineer | Required | Defines data strategy for CV engineering teams. Shapes data platform for computer vision: dataset management tools, annotation pipeline standards, and data quality governance for training data. Coordinates teams on data sharing practices and cross-project dataset reuse. Drives adoption of efficient data processing tools. |
| Data Analyst | Required | Defines data engineering strategy for analytics teams. Shapes data platform: tool selection and standards for data transformation, pipeline orchestration, and quality governance. Coordinates analytics teams on shared data assets and best practices. Drives adoption of modern data processing frameworks (Polars, DuckDB) for analytical workloads. |
| Data Engineer | Required | Defines data processing standards: when pandas/Polars vs Spark, coding guidelines, testing patterns. Implements benchmarking for tool selection. Trains team on Polars adoption. |
| Data Scientist | Required | Defines data engineering strategy for ML teams. Shapes ML data platform: feature store architecture, data pipeline standards, and training data governance. Coordinates ML teams on shared feature engineering libraries and data quality practices. Drives adoption of modern data processing tools (Polars, Ray) for ML workloads. |
| LLM Engineer | Required | Defines data engineering strategy. Establishes data platform. Coordinates data teams. Optimizes data mesh/data fabric approaches. |
| ML Engineer | Required | Defines data processing standards for ML team. Creates feature engineering framework. Trains the team on efficient data handling. |
| NLP Engineer | Required | Defines data processing standards for the NLP team. Establishes best practices for pandas/Polars usage, defines data processing patterns, and trains the team on optimization. |
| Role | Required | Description |
|---|---|---|
| Analytics Engineer | Required | Architects the transformation tool selection strategy: dbt (SQL) as primary, Python models for ML feature engineering and complex logic. Defines the integration architecture for pandas/polars/PySpark with dbt for hybrid pipelines. |
| BI Analyst | Required | Defines organizational data strategy for business intelligence: enterprise data platform design, data governance framework, and self-service analytics vision. Evaluates emerging data technologies for BI transformation. Drives data-driven culture adoption across the organization. Shapes data literacy and tooling standards at enterprise level. |
| Computer Vision Engineer | Required | Defines organizational data strategy for AI/CV: enterprise dataset management platform, data governance for ML training data, and cross-team data sharing architecture. Evaluates emerging data technologies for CV workloads. Drives adoption of efficient data processing practices across CV teams. |
| Data Analyst | Required | Defines organizational data strategy: enterprise data platform architecture, data governance framework, and data democratization vision. Evaluates emerging data technologies and processing frameworks. Drives data-driven culture and data literacy across the organization. Shapes industry practices through thought leadership in data engineering. |
| Data Engineer | Required | Defines local data processing strategy: DuckDB for ad-hoc analytics, Polars for batch ETL, Arrow for zero-copy data exchange. Designs unified API for different backends. |
| Data Scientist | Required | Defines organizational data strategy for ML/AI: enterprise ML data platform, feature store architecture, and data governance for responsible AI. Evaluates emerging data technologies for ML workloads at scale. Drives data engineering excellence across data science teams. Shapes organizational data culture and practices for AI-ready data infrastructure. |
| LLM Engineer | Required | Defines organizational data strategy. Designs enterprise data platforms. Establishes data governance frameworks. |
| ML Engineer | Required | Defines data processing strategy for ML platform. Evaluates novel data processing frameworks. Designs unified data API for ML. |
| NLP Engineer | Required | Shapes enterprise text data processing strategy at organizational level. Defines data processing standards, tool selection, and data pipeline architecture for the NLP platform. |