[Go to site: main page, start]

Python data-engineering

Open-source Python projects categorized as data-engineering

Top 23 Python data-engineering Projects

data-engineering
  1. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: From Zero to Job Data Visualization vs Power BI: Which Wins? | dev.to | 2026-05-07

    For senior engineers building custom job data visualization pipelines, the single biggest latency gain comes from pre-aggregating frequently accessed metrics instead of running joins at query time. In our benchmarks, querying raw job_postings tables with 1M rows took 210ms average, while pre-aggregated tables (updated hourly via PostgreSQL materialized views) reduced query time to 12ms. Use tools like Apache Airflow 2.7.3 to schedule materialized view refreshes during off-peak hours. For example, a materialized view for average salary by company can be defined as:

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: volnux VS Prefect - a user suggested alternative | libhunt.com/r/volnux | 2025-11-19
  4. airbyte

    Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.

    Project mention: Show HN: Airbyte Agents – context for agents across multiple data sources | news.ycombinator.com | 2026-05-05
  5. Taipy

    Turns Data and AI algorithms into production-ready web applications in no time.

  6. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Ten years late to the dbt party (DuckDB edition) | dev.to | 2026-02-23

    I used Dagster, which integrates with dbt nicely (see the point above about how it automagically pulls in documentation). It understands the models and dependencies, and orchestrates everything nicely. It tracks executions and shows you runtimes.

  7. Cookbook

    The Data Engineering Cookbook (by andkret)

  8. great_expectations

    Always know what to expect from your data.

    Project mention: validatelite VS great_expectations - a user suggested alternative | libhunt.com/r/validatelite | 2025-08-08

    Great Expectations is a popular open-source data validation framework with rich features and integrations, but it has a steeper learning curve and heavier setup. ValidateLite offers a lightweight, zero-config CLI alternative for quick checks and automation.

  9. xonsh

    🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

    Project mention: Xonsh shell 0.23 REFORGED – not just a release | news.ycombinator.com | 2026-04-21
  10. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: The Data Awakens: My First Pipeline with Mage AI | dev.to | 2025-09-11

    That’s where Mage AI stood out. From the very first try to run it , it feels really easy and straight forward .

  11. feast

    The Open Source Feature Store for AI/ML

  12. dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: ggsql: A Grammar of Graphics for SQL | news.ycombinator.com | 2026-04-20
  13. AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

  14. meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  15. soda-core

    Data Contracts engine for the modern data stack. https://www.soda.io

    Project mention: Show HN: Data contracts engine for the modern data stack | news.ycombinator.com | 2026-01-28
  16. TrustGraph

    Write context once. Run agents anywhere. Discover the power of holons and dramatically reduce your token usage.

    Project mention: The Context Graph Manifesto | dev.to | 2025-12-31

    When Mark Adams and I (Daniel Davis) began working on what has become TrustGraph over 2 years ago, we knew that graph structures would be instrumental in realizing the potential of AI technology, specifically LLMs.

  17. pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  18. bytewax

    Python Stream Processing

    Project mention: Bytewax: Stream processing library built using Python and Rust | news.ycombinator.com | 2026-05-22
  19. Udacity-Data-Engineering-Projects

    Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

  20. mlrun

    MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

  21. quix-streams

    A Python library for building containerized ML and Generative AI applications with Apache Kafka.

  22. pyper

    Concurrent Python made simple

  23. pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

  24. DataEngineeringProject

    Example end to end data engineering project.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-engineering discussion

Log in or Post with

Python data-engineering related posts

  • PII masking in Polars: MaskOps 2.0, and two metrics that lied to me

    1 project | dev.to | 10 Jun 2026
  • AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen

    1 project | dev.to | 8 Jun 2026
  • From Hours to Seconds: An AI-Powered Metadata Catalog for Unstructured Data on FSx for ONTAP

    1 project | dev.to | 8 Jun 2026
  • Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

    1 project | dev.to | 26 May 2026
  • Bytewax: Stream processing library built using Python and Rust

    1 project | news.ycombinator.com | 22 May 2026
  • Redun: Yet another redundant workflow engine

    1 project | news.ycombinator.com | 14 May 2026
  • Why every data quality tool tells you what broke — but leaves you alone to figure out why

    1 project | dev.to | 12 May 2026
  • A note from our sponsor - SaaSHub
    www.saashub.com | 19 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source data-engineering projects in Python? This list will help you:

# Project Stars
1 Airflow 45,795
2 Prefect 22,636
3 airbyte 21,487
4 Taipy 19,244
5 dagster 15,710
6 Cookbook 15,142
7 great_expectations 11,556
8 xonsh 9,518
9 Mage 8,751
10 feast 7,089
11 dlt 5,460
12 AWS Data Wrangler 4,107
13 meltano 2,538
14 soda-core 2,372
15 TrustGraph 2,187
16 pyspark-example-project 2,087
17 bytewax 1,964
18 Udacity-Data-Engineering-Projects 1,907
19 mlrun 1,672
20 quix-streams 1,555
21 pyper 1,525
22 pyjanitor 1,497
23 DataEngineeringProject 1,411

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com