SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python data-engineering Projects
-
For senior engineers building custom job data visualization pipelines, the single biggest latency gain comes from pre-aggregating frequently accessed metrics instead of running joins at query time. In our benchmarks, querying raw job_postings tables with 1M rows took 210ms average, while pre-aggregated tables (updated hourly via PostgreSQL materialized views) reduced query time to 12ms. Use tools like Apache Airflow 2.7.3 to schedule materialized view refreshes during off-peak hours. For example, a materialized view for average salary by company can be defined as:
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Project mention: volnux VS Prefect - a user suggested alternative | libhunt.com/r/volnux | 2025-11-19
-
airbyte
Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.
Project mention: Show HN: Airbyte Agents – context for agents across multiple data sources | news.ycombinator.com | 2026-05-05 -
-
I used Dagster, which integrates with dbt nicely (see the point above about how it automagically pulls in documentation). It understands the models and dependencies, and orchestrates everything nicely. It tracks executions and shows you runtimes.
-
-
Project mention: validatelite VS great_expectations - a user suggested alternative | libhunt.com/r/validatelite | 2025-08-08
Great Expectations is a popular open-source data validation framework with rich features and integrations, but it has a steeper learning curve and heavier setup. ValidateLite offers a lightweight, zero-config CLI alternative for quick checks and automation.
-
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
That’s where Mage AI stood out. From the very first try to run it , it feels really easy and straight forward .
-
-
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
-
Project mention: Show HN: Data contracts engine for the modern data stack | news.ycombinator.com | 2026-01-28
-
TrustGraph
Write context once. Run agents anywhere. Discover the power of holons and dramatically reduce your token usage.
When Mark Adams and I (Daniel Davis) began working on what has become TrustGraph over 2 years ago, we knew that graph structures would be instrumental in realizing the potential of AI technology, specifically LLMs.
-
-
Project mention: Bytewax: Stream processing library built using Python and Rust | news.ycombinator.com | 2026-05-22
-
Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
mlrun
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
-
quix-streams
A Python library for building containerized ML and Generative AI applications with Apache Kafka.
-
-
-
Python data-engineering discussion
Python data-engineering related posts
-
PII masking in Polars: MaskOps 2.0, and two metrics that lied to me
-
AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen
-
From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP
-
Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy
-
Bytewax: Stream processing library built using Python and Rust
-
Redun: Yet another redundant workflow engine
-
Why every data quality tool tells you what broke — but leaves you alone to figure out why
-
A note from our sponsor - SaaSHub
www.saashub.com | 19 Jun 2026
Index
What are some of the best open-source data-engineering projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | Airflow | 45,795 |
| 2 | Prefect | 22,636 |
| 3 | airbyte | 21,487 |
| 4 | Taipy | 19,244 |
| 5 | dagster | 15,710 |
| 6 | Cookbook | 15,142 |
| 7 | great_expectations | 11,556 |
| 8 | xonsh | 9,518 |
| 9 | Mage | 8,751 |
| 10 | feast | 7,089 |
| 11 | dlt | 5,460 |
| 12 | AWS Data Wrangler | 4,107 |
| 13 | meltano | 2,538 |
| 14 | soda-core | 2,372 |
| 15 | TrustGraph | 2,187 |
| 16 | pyspark-example-project | 2,087 |
| 17 | bytewax | 1,964 |
| 18 | Udacity-Data-Engineering-Projects | 1,907 |
| 19 | mlrun | 1,672 |
| 20 | quix-streams | 1,555 |
| 21 | pyper | 1,525 |
| 22 | pyjanitor | 1,497 |
| 23 | DataEngineeringProject | 1,411 |