[Go to site: main page, start]

Python Text processing

Open-source Python projects categorized as Text processing

Top 23 Python Text processing Projects

Text processing
  1. markitdown

    Python tool for converting files and office documents to Markdown.

    Project mention: Python tool for converting files and office documents to Markdown | news.ycombinator.com | 2026-06-11
  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. docling

    Get your documents ready for gen AI

    Project mention: The Best PDF to Markdown Tools in 2026 (Honestly Compared) | dev.to | 2026-06-10

    IBM's open-source Docling focuses on document understanding: clean structure, solid tables, and exports designed to feed downstream AI pipelines. If your endpoint is a vector database rather than a human reader, it's a strong fit.

  4. mem0

    Universal memory layer for AI Agents

    Project mention: AI Builder Notes - Week of June 8, 2026 | dev.to | 2026-06-08

    Agents need to compress work into state. [10] Mem0 positions memory inside the harness alongside tools and coordination. [11] [17]

  5. pydantic

    Data validation using Python type hints

    Project mention: pydantic VS ctxure - a user suggested alternative | libhunt.com/r/pydantic | 2026-06-05
  6. rendercv

    Resume builder for academics and engineers

    Project mention: Show HN: RenderCV – Open-source CV/resume generator, YAML → PDF | news.ycombinator.com | 2025-12-21

    This is clearly a real project that was built over several years with human effort (not vibe coded). Which makes it all the more depressing that the author decided to take a massive dump over the entire README.md with AI slop.

    Sadly, it appears the project was heavily sloppified a mere 2 weeks ago: https://github.com/rendercv/rendercv/commit/5cc5fbdf9ec1a742...

  7. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  8. Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

    Project mention: Comp Language Syntax | dev.to | 2026-03-30

    I've been using the Lark library to handle parsing. I'd never experimented with it before, but now have a good deal of experience stretching it's LALR rules. This gives an ideal O(n) performance. I have been surprised at the cost of building the grammar at runtime. Fortunately the library is quite prepared for this and comes with some high level caching options.

  9. 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

  10. ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

    Project mention: Fix mojibake in Unicode text, after the fact | news.ycombinator.com | 2026-05-07
  11. sqlparse

    A non-validating SQL parser module for Python

  12. phonenumbers

    Python port of Google's libphonenumber

  13. TextDistance

    📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  14. chardet

    Python character encoding detector

    Project mention: Grit: Rewriting Git in Rust with Agents | news.ycombinator.com | 2026-06-09
  15. pyparsing

    Python library for creating PEG parsers

  16. shortuuid

    A generator library for concise, unambiguous and URL-safe UUIDs.

  17. typeguard

    Run-time type checker for Python

    Project mention: Steel Bank Common Lisp | news.ycombinator.com | 2026-02-24

    >For example, it cannot specialize an element type for lists.

    Yes, but that would be a CL violation (or an extension to provide via something else than DEFTYPE), since DEFTYPE's body can't be infinitely recursive; cf https://www.lispworks.com/documentation/HyperSpec/Body/m_def...

    >if you attempt to (declaim) every function, you will immediately see how vague and insufficient the types come out compared to even Python.

    Indeed, but it is 1) used by the compiler itself while Cpython currently ignores annotations and 2) runtime and buildtime typing use the same semantics and syntax, so you don't need band-aids like https://github.com/agronholm/typeguard

    But yeah, CL's type system is lacking in many places. In order of practical advantages and difficulty to add (maybe): recursive DEFTYPE, typed HASH-TABLEs (I mean the keys and values), static typing of CLOS slots (invasive, like https://github.com/marcoheisig/fast-generic-functions), ..., parametric typing beyond ARRAYs.

  18. python-slugify

    Returns unicode slugs

  19. pyfiglet

    An implementation of figlet written in Python

    Project mention: Make ASCII Art Anywhere_ Use Python’s `pyfiglet` in Ruby, .NET Core, and Node.js with Javonet | dev.to | 2025-08-21

    pyfiglet is a full Python port of the original FIGlet tool. It turns simple text into stylish ASCII banners using various fonts. It's perfect for:

  20. DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  21. python-user-agents

    A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

  22. mirascope

    The LLM Anti-Framework

    Project mention: Show HN: Mirascope – The LLM Anti-Framework | news.ycombinator.com | 2026-01-26
  23. hazm

    Persian NLP Toolkit

  24. pythainlp

    Thai natural language processing in Python

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Text processing discussion

Log in or Post with

Python Text processing related posts

  • Python tool for converting files and office documents to Markdown

    1 project | news.ycombinator.com | 11 Jun 2026
  • The Best PDF to Markdown Tools in 2026 (Honestly Compared)

    3 projects | dev.to | 10 Jun 2026
  • AI Builder Notes - Week of June 8, 2026

    4 projects | dev.to | 8 Jun 2026
  • A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)

    2 projects | dev.to | 26 May 2026
  • My Local RAG article went viral. The product it promoted sold 1 copy in 6 months.

    1 project | dev.to | 20 May 2026
  • Fix mojibake in Unicode text, after the fact

    1 project | news.ycombinator.com | 7 May 2026
  • MarkItDown vs Docling vs Marker: PDF to Markdown for LLMs

    3 projects | dev.to | 3 May 2026
  • A note from our sponsor - SaaSHub
    www.saashub.com | 12 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Text processing projects in Python? This list will help you:

# Project Stars
1 markitdown 146,166
2 docling 61,364
3 mem0 58,292
4 pydantic 27,952
5 rendercv 16,840
6 PyMuPDF 9,969
7 Lark 5,901
8 汉字拼音转换工具(Python 版) 5,318
9 ftfy 4,036
10 sqlparse 4,003
11 phonenumbers 3,747
12 TextDistance 3,526
13 chardet 2,633
14 pyparsing 2,472
15 shortuuid 2,188
16 typeguard 1,763
17 python-slugify 1,616
18 pyfiglet 1,562
19 DataProfiler 1,557
20 python-user-agents 1,515
21 mirascope 1,493
22 hazm 1,397
23 pythainlp 1,135

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 1st most popular programming language
based on number of references?