[Go to site: main page, start]

Python Data

Open-source Python projects categorized as Data

Top 23 Python Data Projects

  1. 30-Days-Of-Python

    The 30 Days of Python programming challenge is a step-by-step guide to learn the Python programming language in 30 days. This challenge may take more than 100 days. Follow your own pace. These videos may help too: https://www.youtube.com/channel/UC7PNRuno1rzYPb1xLa4yktw

    Project mention: Free Python Resources | dev.to | 2026-01-16

    A free, beginner‑friendly programming challenge on GitHub, 30 Days of Python breaks down the process of learning Python into daily lessons and exercises that guide you step by step from the basics to more advanced concepts.

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. Scrapling

    🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

    Project mention: Launch HN: Intuned (YC S22) – Build and run reliable browser automations as code | news.ycombinator.com | 2026-06-08

    What is the advantage of your product over having Codex generate a script using something like https://github.com/D4Vinci/Scrapling?

  4. llama_index

    LlamaIndex is the leading document agent and OCR platform

    Project mention: Wait... FDE Is Not a JavaScript Framework? | dev.to | 2026-06-09

    LLM / agentic frameworks: LangChain, LlamaIndex, LangGraph, AutoGen, MCP, RAG. (Fiddler and Razorpay both list these. "Hands-on counts, not just awareness," as Razorpay puts it.)

  5. pandas-ai

    Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

    Project mention: 📰 All Data and AI Weekly #231-02March2026 | dev.to | 2026-03-02

    Pandas-AI: Talk to your dataframes in natural language.

  6. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: volnux VS Prefect - a user suggested alternative | libhunt.com/r/volnux | 2025-11-19
  7. airbyte

    Open-source data movement for ELT pipelines and AI agents — from APIs, databases & files to warehouses, lakes, and AI applications. Both self-hosted and Cloud.

    Project mention: Show HN: Airbyte Agents – context for agents across multiple data sources | news.ycombinator.com | 2026-05-05
  8. akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  9. chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  10. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: The Data Awakens: My First Pipeline with Mage AI | dev.to | 2025-09-11

    That’s where Mage AI stood out. From the very first try to run it , it feels really easy and straight forward .

  11. knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

  12. dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: ggsql: A Grammar of Graphics for SQL | news.ycombinator.com | 2026-04-20
  13. superduper

    Superduper: End-to-end framework for building custom AI applications and agents.

  14. CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

    Project mention: CKAN – an open-source DMS (data management system) | news.ycombinator.com | 2026-03-02
  15. Mimesis

    Mimesis is a fast Python library for generating fake data in multiple languages.

  16. DataFlow

    Easy Data Preparation with latest LLMs-based Operators and Pipelines.

    Project mention: Show HN: DataFlow,Turn raw data into high-quality LLM training datasets | news.ycombinator.com | 2026-03-16
  17. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  18. preswald

    Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.

  19. LazyLLM

    Easiest and laziest way for building multi-agent LLMs applications.

    Project mention: LazyLLM, Easiest and laziest way for building multi-agent LLMs applications | news.ycombinator.com | 2025-11-05
  20. docetl

    A system for agentic LLM-powered data processing and ETL

  21. TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  22. pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  23. PyPika

    PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

  24. Colour

    Colour Science for Python

    Project mention: Rendering the Visible Spectrum | news.ycombinator.com | 2026-02-18

    If you are interested in this topic, we have a fully feature colour science Python package that can of course render the visible spectrum: https://github.com/colour-science/colour?tab=readme-ov-file#...

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data discussion

Log in or Post with

Python Data related posts

  • Automatically Validate Python Packages

    1 project | news.ycombinator.com | 17 Jun 2026
  • CocoIndex Review: Incremental RAG Engine for AI Agents

    2 projects | dev.to | 12 May 2026
  • Pydantic vs msgspec vs validatedata: Why Your Validation Library Slows Down on Bad Data

    1 project | dev.to | 11 May 2026
  • Anthropic's 10 Finance Agents: A Buyer's Guide for Banks

    3 projects | dev.to | 5 May 2026
  • Show HN: Airbyte Agents – context for agents across multiple data sources

    2 projects | news.ycombinator.com | 5 May 2026
  • Deep Dive into LlamaIndex's RAG Pipeline and Pinecone Vector Database Integration

    2 projects | dev.to | 4 May 2026
  • Implementing a Centralized Metrics Layer with dbt and a Semantic Layer

    1 project | dev.to | 21 Apr 2026
  • A note from our sponsor - SaaSHub
    www.saashub.com | 18 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Data projects in Python? This list will help you:

# Project Stars
1 30-Days-Of-Python 64,710
2 Scrapling 64,713
3 llama_index 50,089
4 pandas-ai 23,587
5 Prefect 22,636
6 airbyte 21,487
7 akshare 20,305
8 chinese-xinhua 11,504
9 Mage 8,749
10 knowledge-repo 5,533
11 dlt 5,460
12 superduper 5,289
13 CKAN 5,053
14 Mimesis 4,813
15 DataFlow 5,047
16 datasets 4,570
17 preswald 4,284
18 LazyLLM 3,842
19 docetl 3,836
20 TextRecognitionDataGenerator 3,660
21 pandas-datareader 3,181
22 PyPika 2,917
23 Colour 2,600

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com