SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Text processing Projects
-
Project mention: Python tool for converting files and office documents to Markdown | news.ycombinator.com | 2026-06-11
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
IBM's open-source Docling focuses on document understanding: clean structure, solid tables, and exports designed to feed downstream AI pipelines. If your endpoint is a vector database rather than a human reader, it's a strong fit.
-
Agents need to compress work into state. [10] Mem0 positions memory inside the harness alongside tools and coordination. [11] [17]
-
Project mention: pydantic VS ctxure - a user suggested alternative | libhunt.com/r/pydantic | 2026-06-05
-
Project mention: Show HN: RenderCV – Open-source CV/resume generator, YAML → PDF | news.ycombinator.com | 2025-12-21
This is clearly a real project that was built over several years with human effort (not vibe coded). Which makes it all the more depressing that the author decided to take a massive dump over the entire README.md with AI slop.
Sadly, it appears the project was heavily sloppified a mere 2 weeks ago: https://github.com/rendercv/rendercv/commit/5cc5fbdf9ec1a742...
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
I've been using the Lark library to handle parsing. I'd never experimented with it before, but now have a good deal of experience stretching it's LALR rules. This gives an ideal O(n) performance. I have been surprised at the cost of building the grammar at runtime. Fortunately the library is quite prepared for this and comes with some high level caching options.
-
-
-
-
-
TextDistance
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
-
-
-
-
>For example, it cannot specialize an element type for lists.
Yes, but that would be a CL violation (or an extension to provide via something else than DEFTYPE), since DEFTYPE's body can't be infinitely recursive; cf https://www.lispworks.com/documentation/HyperSpec/Body/m_def...
>if you attempt to (declaim) every function, you will immediately see how vague and insufficient the types come out compared to even Python.
Indeed, but it is 1) used by the compiler itself while Cpython currently ignores annotations and 2) runtime and buildtime typing use the same semantics and syntax, so you don't need band-aids like https://github.com/agronholm/typeguard
But yeah, CL's type system is lacking in many places. In order of practical advantages and difficulty to add (maybe): recursive DEFTYPE, typed HASH-TABLEs (I mean the keys and values), static typing of CLOS slots (invasive, like https://github.com/marcoheisig/fast-generic-functions), ..., parametric typing beyond ARRAYs.
-
-
Project mention: Make ASCII Art Anywhere_ Use Python’s `pyfiglet` in Ruby, .NET Core, and Node.js with Javonet | dev.to | 2025-08-21
pyfiglet is a full Python port of the original FIGlet tool. It turns simple text into stylish ASCII banners using various fonts. It's perfect for:
-
-
python-user-agents
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
-
-
-
Python Text processing discussion
Python Text processing related posts
-
Python tool for converting files and office documents to Markdown
-
The Best PDF to Markdown Tools in 2026 (Honestly Compared)
-
AI Builder Notes - Week of June 8, 2026
-
A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)
-
My Local RAG article went viral. The product it promoted sold 1 copy in 6 months.
-
Fix mojibake in Unicode text, after the fact
-
MarkItDown vs Docling vs Marker: PDF to Markdown for LLMs
-
A note from our sponsor - SaaSHub
www.saashub.com | 12 Jun 2026
Index
What are some of the best open-source Text processing projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | markitdown | 146,166 |
| 2 | docling | 61,364 |
| 3 | mem0 | 58,292 |
| 4 | pydantic | 27,952 |
| 5 | rendercv | 16,840 |
| 6 | PyMuPDF | 9,969 |
| 7 | Lark | 5,901 |
| 8 | 汉字拼音转换工具(Python 版) | 5,318 |
| 9 | ftfy | 4,036 |
| 10 | sqlparse | 4,003 |
| 11 | phonenumbers | 3,747 |
| 12 | TextDistance | 3,526 |
| 13 | chardet | 2,633 |
| 14 | pyparsing | 2,472 |
| 15 | shortuuid | 2,188 |
| 16 | typeguard | 1,763 |
| 17 | python-slugify | 1,616 |
| 18 | pyfiglet | 1,562 |
| 19 | DataProfiler | 1,557 |
| 20 | python-user-agents | 1,515 |
| 21 | mirascope | 1,493 |
| 22 | hazm | 1,397 |
| 23 | pythainlp | 1,135 |