Top 23 Python data-engineering Projects

Airflow

1 205 45,795 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: From Zero to Job Data Visualization vs Power BI: Which Wins? | dev.to | 2026-05-07

For senior engineers building custom job data visualization pipelines, the single biggest latency gain comes from pre-aggregating frequently accessed metrics instead of running joins at query time. In our benchmarks, querying raw job_postings tables with 1M rows took 210ms average, while pre-aggregated tables (updated hourly via PostgreSQL materialized views) reduced query time to 12ms. Use tools like Apache Airflow 2.7.3 to schedule materialized view refreshes during off-peak hours. For example, a materialized view for average salary by company can be defined as:
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Prefect

2 21 22,636 10.0 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: volnux VS Prefect - a user suggested alternative | libhunt.com/r/volnux | 2025-11-19
airbyte

3 160 21,487 10.0 Python

Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.

Project mention: Show HN: Airbyte Agents – context for agents across multiple data sources | news.ycombinator.com | 2026-05-05
Taipy

4 24 19,244 9.0 Python

Turns Data and AI algorithms into production-ready web applications in no time.
dagster

5 62 15,710 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

Project mention: Ten years late to the dbt party (DuckDB edition) | dev.to | 2026-02-23

I used Dagster, which integrates with dbt nicely (see the point above about how it automagically pulls in documentation). It understands the models and dependencies, and orchestrates everything nicely. It tracks executions and shows you runtimes.
Cookbook

6 21 15,142 2.3 Python

The Data Engineering Cookbook (by andkret)
great_expectations

7 16 11,556 9.8 Python

Always know what to expect from your data.

Project mention: validatelite VS great_expectations - a user suggested alternative | libhunt.com/r/validatelite | 2025-08-08

Great Expectations is a popular open-source data validation framework with rich features and integrations, but it has a steeper learning curve and heavier setup. ValidateLite offers a lightweight, zero-config CLI alternative for quick checks and automation.
xonsh

8 133 9,518 9.9 Python

🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

Project mention: Xonsh shell 0.23 REFORGED – not just a release | news.ycombinator.com | 2026-04-21
Mage

9 80 8,751 9.2 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: The Data Awakens: My First Pipeline with Mage AI | dev.to | 2025-09-11

That’s where Mage AI stood out. From the very first try to run it , it feels really easy and straight forward .
feast

10 14 7,089 9.7 Python

The Open Source Feature Store for AI/ML
dlt

11 11 5,460 9.7 Python

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Project mention: ggsql: A Grammar of Graphics for SQL | news.ycombinator.com | 2026-04-20
AWS Data Wrangler

12 9 4,107 8.5 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
meltano

13 9 2,538 9.8 Python

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
soda-core

14 6 2,372 8.8 Python

Data Contracts engine for the modern data stack. https://www.soda.io

Project mention: Show HN: Data contracts engine for the modern data stack | news.ycombinator.com | 2026-01-28
TrustGraph

15 11 2,187 9.7 Python

Write context once. Run agents anywhere. Discover the power of holons and dramatically reduce your token usage.

Project mention: The Context Graph Manifesto | dev.to | 2025-12-31

When Mark Adams and I (Daniel Davis) began working on what has become TrustGraph over 2 years ago, we knew that graph structures would be instrumental in realizing the potential of AI technology, specifically LLMs.
pyspark-example-project

16 1 2,087 0.0 Python

Implementing best practices for PySpark ETL jobs and applications.
bytewax

17 21 1,964 9.2 Python

Python Stream Processing

Project mention: Bytewax: Stream processing library built using Python and Rust | news.ycombinator.com | 2026-05-22
Udacity-Data-Engineering-Projects

18 5 1,907 0.0 Python

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
mlrun

19 3 1,672 9.9 Python

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
quix-streams

20 26 1,555 8.9 Python

A Python library for building containerized ML and Generative AI applications with Apache Kafka.
pyper

21 2 1,525 9.0 Python

Concurrent Python made simple
pyjanitor

22 4 1,497 9.5 Python

Clean APIs for data cleaning. Python implementation of R package Janitor
DataEngineeringProject

23 5 1,411 0.0 Python

Example end to end data engineering project.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-engineering discussion

Python data-engineering related posts

PII masking in Polars: MaskOps 2.0, and two metrics that lied to me

1 project | dev.to | 10 Jun 2026
AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen

1 project | dev.to | 8 Jun 2026
From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

1 project | dev.to | 8 Jun 2026
Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

1 project | dev.to | 26 May 2026
Bytewax: Stream processing library built using Python and Rust

1 project | news.ycombinator.com | 22 May 2026
Redun: Yet another redundant workflow engine

1 project | news.ycombinator.com | 14 May 2026
Why every data quality tool tells you what broke — but leaves you alone to figure out why

1 project | dev.to | 12 May 2026
A note from our sponsor - SaaSHub
www.saashub.com | 19 Jun 2026

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source data-engineering projects in Python? This list will help you:

#	Project	Stars
1	Airflow	45,795
2	Prefect	22,636
3	airbyte	21,487
4	Taipy	19,244
5	dagster	15,710
6	Cookbook	15,142
7	great_expectations	11,556
8	xonsh	9,518
9	Mage	8,751
10	feast	7,089
11	dlt	5,460
12	AWS Data Wrangler	4,107
13	meltano	2,538
14	soda-core	2,372
15	TrustGraph	2,187
16	pyspark-example-project	2,087
17	bytewax	1,964
18	Udacity-Data-Engineering-Projects	1,907
19	mlrun	1,672
20	quix-streams	1,555
21	pyper	1,525
22	pyjanitor	1,497
23	DataEngineeringProject	1,411

Python data-engineering

Top 23 Python data-engineering Projects

Python data-engineering discussion

Python data-engineering related posts

PII masking in Polars: MaskOps 2.0, and two metrics that lied to me

AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen

From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

Bytewax: Stream processing library built using Python and Rust

Redun: Yet another redundant workflow engine

Why every data quality tool tells you what broke — but leaves you alone to figure out why

Index