﻿<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[AI Exploration Journey]]></title><description><![CDATA[A passionate AI researcher who enjoys delving into underlying principles and writing in-depth articles.]]></description><link>https://aiexpjourney.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!5JuB!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c45940b-721b-4f3b-ad25-2541b18dcd88_599x599.png</url><title>AI Exploration Journey</title><link>https://aiexpjourney.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 21 Jun 2026 09:39:06 GMT</lastBuildDate><atom:link href="https://aiexpjourney.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Florian June]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[aiexpjourney@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[aiexpjourney@substack.com]]></itunes:email><itunes:name><![CDATA[Florian]]></itunes:name></itunes:owner><itunes:author><![CDATA[Florian]]></itunes:author><googleplay:owner><![CDATA[aiexpjourney@substack.com]]></googleplay:owner><googleplay:email><![CDATA[aiexpjourney@substack.com]]></googleplay:email><googleplay:author><![CDATA[Florian]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Attention Is All You Need—for Retrieval Too? — AI Innovations and Insights 141]]></title><description><![CDATA[Wait, is external retrieval in RAG actually redundant?]]></description><link>https://aiexpjourney.substack.com/p/attention-is-all-you-needfor-retrieval</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/attention-is-all-you-needfor-retrieval</guid><pubDate>Fri, 19 Jun 2026 09:59:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EEYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccff5dca-0207-42e8-88d7-484e2fdcdb2a_3034x1480.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Wait, is external retrieval in RAG actually redundant? </p><p>Let&#8217;s take a closer look at NVIDIA&#8217;s latest finding, one that challenges a core assumption about how RAG should work.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Overlooked Problem in RAG</h2><p>The typical architecture behind Retrieval-Augmented Generation (RAG):</p><ul><li><p>A retriever (such as BM25, BGE, or ColBERT) fetches relevant documents from a corpus.</p></li><li><p>A generator (usually an LLM) re-encodes these retrieved texts and then produces the final answer.</p></li></ul><p>This separation creates a persistent mismatch: the retriever and generator often operate in different representation spaces.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/attention-is-all-you-needfor-retrieval">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MinerU-Popo: Fixing the Hidden Flaw in PDF Parsing — AI Innovations and Insights 140]]></title><description><![CDATA[Modern OCR can produce impressive page-level results, but real documents do not think in pages. They think in chapters, sections, explanations, figures, tables, and long-running context.]]></description><link>https://aiexpjourney.substack.com/p/mineru-popo-fixing-the-hidden-flaw</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mineru-popo-fixing-the-hidden-flaw</guid><pubDate>Sun, 14 Jun 2026 01:48:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8Nli!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Modern OCR can produce impressive page-level results, but real documents do not think in pages.</strong> They think in chapters, sections, explanations, figures, tables, and long-running context.</p><p>This post might offer an insightful perspective.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation</h2><p>Current Visual Language Models (VLM) and OCR tools already do a strong job extracting content from individual PDF pages.</p><p>However, they typically output structures limited to individual pages. What downstream tasks like RAG (Retrieval-Augmented Generation), document Q&amp;A, and automated analytics really need is a coherent, document-level structure that spans multiple pages.</p><p>Specifically, existing OCR methods struggle in three main areas:</p><ol><li><p>Cross-page Fragmentation: Paragraphs and tables often get split at page boundaries, and OCR systems struggle to recognize these separated parts as belonging to the same logical unit.</p></li><li><p>Structural Ambiguity: Recovering heading hierarchies and linking captions, images, tables, and governing sections requires reasoning over multiple OCR elements, often across page boundaries.</p></li><li><p>Cost and Scalability: Relying heavily on manual review or high-end VLMs for post-processing is expensive, making it impractical for large-scale projects, privacy-sensitive situations, or local deployments.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Nli!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Nli!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 424w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 848w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 1272w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Nli!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png" width="1200" height="542.3076923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:867958,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/200431519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8Nli!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 424w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 848w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 1272w, https://substackcdn.com/image/fetch/$s_!8Nli!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ae348f-6c99-4404-b619-57f1eb95a106_1668x754.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: MinerU-Popo supports diverse OCR models through type-label alignment and bridges pagelevel parsing with document-level modeling via four post-processing subtasks. Applying MinerUPopo after OCR yields clear improvements in title hierarchy analysis. [<a href="https://arxiv.org/pdf/2605.24973v1">Source</a>].</figcaption></figure></div><h2>MinerU-Popo: From Broken Pages to Smart Documents</h2><p>The core idea of MinerU-Popo is straightforward: adding a lightweight &#8220;document structure fixer&#8221; after OCR. It transforms the fragmented, page-level OCR outputs into coherent, document-level structures.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/mineru-popo-fixing-the-hidden-flaw">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Unlocking PDFs for RAG: How RAG-Anything Handles Complex Documents — AI Innovations and Insights 139]]></title><description><![CDATA[PDF files are notoriously complex.]]></description><link>https://aiexpjourney.substack.com/p/unlocking-pdfs-for-rag-how-rag-anything</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/unlocking-pdfs-for-rag-how-rag-anything</guid><pubDate>Wed, 10 Jun 2026 00:32:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!--g9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>PDF files are notoriously complex. Rather than simple text documents, they&#8217;re more like containers for formatted content, filled with text layers, scanned images, intricate layouts, tables, equations, illustrations, and footnotes. Parsing PDFs effectively thus represents the pinnacle of document parsing.</p><p>For a RAG (Retrieval-Augmented Generation) system, <strong>the ultimate goal of PDF parsing</strong> is to extract structured content that&#8217;s searchable, referenceable, and analyzable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dvg_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dvg_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 424w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 848w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dvg_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png" width="1200" height="521.7032967032967" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:633,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RAG-Anything&quot;,&quot;title&quot;:&quot;RAG-Anything&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="RAG-Anything" title="RAG-Anything" srcset="https://substackcdn.com/image/fetch/$s_!Dvg_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 424w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 848w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 1272w, https://substackcdn.com/image/fetch/$s_!Dvg_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc052d2c-e404-4946-a1b1-85c1c3c15df4_2498x1086.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: The overview of RAG-Anything. [<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/README.md">Source</a>].</figcaption></figure></div><p><strong>My interest in PDF parsing</strong> within RAG-Anything, a well-known open-source RAG project, led me to write this article.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2>Key Concepts</h2><p>Let&#8217;s start with a few essential concepts.</p><p>A <strong>parser</strong> can be thought of as a &#8220;document disassembler.&#8221; It takes inputs like PDFs, images, or Office documents and breaks them down into data chunks suitable for subsequent system processing. In RAG-Anything&#8217;s code, a unified interface (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L68">raganything/parser.py:68</a>) defines key methods: <code>parse_pdf()</code>, <code>parse_image()</code>, <code>parse_document()</code>, and <code>check_installation()</code>.</p><p>The <strong>content_list</strong> is the standardized intermediate format after parsing. Think of it intuitively as a &#8220;table of contents&#8221; for document content: page 0 might contain text, page 1 an image, page 2 a table, and page 3 an equation. Technically, it&#8217;s represented as <code>List[Dict[str, Any]]</code>, with each item specifying its <code>type</code>, <code>page_idx</code>, and the actual content. Detailed structures for text, images, tables, and equations can be found in the README.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1S5v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1S5v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 424w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 848w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 1272w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1S5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png" width="726.7109375" height="296.4741049965659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:594,&quot;width&quot;:1456,&quot;resizeWidth&quot;:726.7109375,&quot;bytes&quot;:273347,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/199555363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1S5v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 424w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 848w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 1272w, https://substackcdn.com/image/fetch/$s_!1S5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31c9370-381c-4f48-bc60-51b0dca3158f_2050x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: The <strong>content_list structure</strong>. [<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/README.md">Source</a>].</figcaption></figure></div><p><strong>Multimodal processing </strong>refers to handling non-textual content, which can&#8217;t simply be inserted into a RAG system as regular text without losing important semantic details. Hence, the parser identifies images, tables, and equations as distinct items. Later, dedicated modal processors utilize visual models or language models to generate descriptions, entities, and relationships from these non-textual components.</p><h2>PDF Parsing Workflow</h2><p>Figure 3 outlines the complete PDF parsing workflow used by RAG-Anything.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ghpf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ghpf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ghpf!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png" width="1200" height="800.2747252747253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1381480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/199555363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ghpf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ghpf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b0a98e9-a00f-4718-b4a1-b3465d2bec59_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: A multimodal document ingestion pipeline that parses PDFs, images, and Office files into text and modality-aware chunks for LightRAG. Image by the author.</figcaption></figure></div><p>When processing a PDF, <code>ProcessorMixin.parse_document()</code> first checks if the local file exists. It then generates a cache key (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/processor.py#L48">raganything/processor.py:48</a>) based on the file path, modification time, parser name, parse method, and certain parsing parameters. Then get cached parsing result if available and valid (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/processor.py#L239">raganything/processor.py:239</a>).</p><p>If the cache misses, the PDF file triggers the current parser&#8217;s <code>parse_pdf()</code> method. To prevent blocking the main asynchronous workflow, this potentially heavy parsing task runs in a separate thread using <code>asyncio.to_thread()</code>(<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/processor.py#L470">raganything/processor.py:470</a>).</p><p>In the default PDF parsing flow (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L1078">raganything/parser.py:1078</a>), it creates an output directory named using a hashed file path to prevent filename collisions during batch processing. The <code>_run_mineru_command()</code> function (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L714">raganything/parser.py:714</a>) then builds and executes the MinerU CLI command. MinerU (<a href="https://aiexpjourney.substack.com/p/from-big-picture-to-details-mineru">From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77</a>) is designed to handle OCR, table, formula extraction, and multiple backend parsing scenarios, accepting parameters like method, backend, device, start_page, end_page, formula, table, and source.</p><p>After running MinerU, the project does not treat the command-line output as the final result. Instead, it reads JSON and Markdown files produced by MinerU (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L960">raganything/parser.py:960</a><code>)</code>. The code accommodates variations in field names, such as <code>img_caption</code> versus <code>image_caption</code>. It also converts paths for images, tables, and equations into absolute paths and verifies them to avoid security issues. </p><p>Once parsing completes, the system generates a content-based <code>doc_id</code>. It then separates text and multimodal content (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/utils.py#L89">raganything/utils.py:89</a><code>)</code>. Textual data moves into LightRAG, while images, tables, and equations are handled by the multimodal processors (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/processor.py#L607">raganything/processor.py:607</a><code>)</code>.</p><h2>Why Three Parsers?</h2><p>If the goal were just extracting text from simple PDFs, a lightweight library would suffice. But RAG-Anything tackles more challenging scenarios: scanned documents, intricate layouts, tables, equations, images, and subsequent multimodal understanding.</p><p>This complexity drives the project&#8217;s choice to combine specialized external parsers with an internal standardization layer. External tools manage extraction, while RAG-Anything handles parser selection, caching, field compatibility, path normalization, content_list standardization, and integration with LightRAG (<a href="https://aiexpjourney.substack.com/p/ai-innovations-and-trends-03-lightrag">AI Innovations and Trends 03: LightRAG, Docling, DRIFT, and More</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!--g9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!--g9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!--g9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!--g9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!--g9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!--g9!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png" width="1200" height="800.2747252747253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1572804,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/199555363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!--g9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!--g9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!--g9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!--g9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3febf16e-1b04-4291-9e60-f62c1568ab03_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: RAG-Anything orchestrates multiple parsing backends through a shared parser interface, mapping MinerU, Docling, and PaddleOCR outputs into a unified content_list for LightRAG indexing and multimodal chunk processing. Image by the author.</figcaption></figure></div><p>So RAG-Anything doesn&#8217;t limit PDF parsing to just one library. Instead, it introduces a unified parser abstraction alongside multiple implementations.</p><ul><li><p>The default approach is intended for extracting complex PDF structures Using <strong>MinerU 2.0</strong> (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L694">raganything/parser.py:694</a>). But this strength comes with costs: relying on external CLIs and models means dealing with environment setup, initial model downloads, and subprocess errors.</p></li><li><p>Another implementation leverages <strong>Docling&#8217;s</strong> Python API (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L1455">raganything/parser.py:1455</a>, <a href="https://aiexpjourney.substack.com/p/ai-innovations-and-trends-03-lightrag">AI Innovations and Trends 03: LightRAG, Docling, DRIFT, and More</a>). It optimizes efficiency by caching the <code>DocumentConverter</code> through <code>_get_converter() </code>(<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L1627">raganything/parser.py:1627</a>), reducing redundant model loading. DoclingParser supports both Office documents and HTML parsing. However, to clarify: even though DoclingParser itself includes <code>parse_html()</code>, the higher-level method <code>ProcessorMixin.parse_document()</code> currently doesn&#8217;t route HTML files directly through this method. Thus, it&#8217;s not entirely accurate to say HTML parsing is fully integrated into the top-level flow.</p></li><li><p>The third option is <strong>PaddleOCRParser</strong> (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L2052">raganything/parser.py:2052</a>) functions more like a dedicated OCR path. PDFs are rendered into images page by page using <code>pypdfium2</code>, followed by OCR processing (<a href="https://github.com/HKUDS/RAG-Anything/blob/v1.3.1/raganything/parser.py#L2205">raganything/parser.py:2205</a>). The output is compatible with <code>content_list</code> format, but structural semantics like tables, images, or equations aren&#8217;t preserved.</p></li></ul><p>This design is about downstream processes depending only on a unified <code>content_list</code>, making parsers interchangeable based on specific needs.</p><p>The three implementations also differ operationally: MinerU shells out to a CLI, Docling uses an in-process Python API with converter caching, and PaddleOCR lazily imports optional OCR dependencies.</p><h2>Thoughts</h2><p>RAG-Anything&#8217;s PDF parsing can be summarized as follows: specialized parsers handle complex documents, the results feed into a standardized intermediate structure called <code>content_list</code>, and from there, textual and multimodal contents flow separately into the RAG pipeline.</p><p><strong>But I have a concern.</strong> While <code>content_list</code> serves as the project&#8217;s stable boundary, it also represents a point where information may be lost. Details like the original PDF layout, bounding boxes, hierarchical structure, and confidence in reading order are not fully preserved.</p><p>Finally, if you want to explore other core mechanisms of RAG-Anything or have any thoughts about this post, feel free to leave a comment.</p><div><hr></div><p>Reference: <a href="https://github.com/HKUDS/RAG-Anything">https://github.com/HKUDS/RAG-Anything</a>.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/unlocking-pdfs-for-rag-how-rag-anything?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading AI Exploration Journey! Feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/unlocking-pdfs-for-rag-how-rag-anything?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aiexpjourney.substack.com/p/unlocking-pdfs-for-rag-how-rag-anything?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Hidden Bottleneck in RAG Code Completion: How You Chunk the Code — AI Innovations and Insights 138]]></title><description><![CDATA[In a previous article, I discussed how Claude Code handles retrieval (The Secret Behind Claude Code&#8217;s Retrieval: Why Live Search Fits Better than RAG &#8212; AI Innovations and Insights 134), noting that it basically doesn&#8217;t incorporate a RAG process.]]></description><link>https://aiexpjourney.substack.com/p/the-hidden-bottleneck-in-rag-code</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/the-hidden-bottleneck-in-rag-code</guid><pubDate>Sat, 06 Jun 2026 04:49:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DMKK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5065275a-ccc6-4112-8960-f10a83054cc3_2420x1572.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a previous article, I discussed how Claude Code handles retrieval (<a href="https://aiexpjourney.substack.com/p/the-secret-behind-claude-codes-retrieval">The Secret Behind Claude Code&#8217;s Retrieval: Why Live Search Fits Better than RAG &#8212; AI Innovations and Insights 134</a>), noting that it basically doesn&#8217;t incorporate a RAG process. </p><p>Today, let&#8217;s dive into the approach and findings related to using RAG for code completion.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>When using RAG for code completion, it&#8217;s common practice to first divide the code into chunks before retrieval. However, the question of exactly how to chunk the code hasn&#8217;t been systematically studied. </p><p>Most existing work has focused primarily on optimizing the retriever or generator, treating chunking merely as a default preprocessing step. But chunking directly influences the smallest unit of code available to the retriever. Poor chunking can degrade the quality of completions, no matter how advanced the retriever or the language model itself may be.</p><p>So here&#8217;s something worth thinking about: In retrieval-augmented code completion, how important is chunking really? Is structured chunking truly better than a simple sliding window approach? And when balancing chunk size, context length, cost, and effectiveness, what&#8217;s the best trade-off?</p><h2>Retrieval-Augmented Code Completion: Key Technologies</h2><p>Before jumping into the findings, let&#8217;s first unpack the key technologies involved.</p><h4>Chunking: How to Divide Code for Retrieval</h4><p>Chunking means breaking source code files into small, retrievable pieces that the retriever can work with later. There are four representative strategies:</p><ul><li><p><strong>Function Chunking</strong>: Extracts each top-level function or method, keeping its signature and body together when possible; oversized functions are split along direct child-statement boundaries.</p></li><li><p><strong>Declaration Chunking: </strong>Retains class headers, field definitions, method signatures, and function-level declarations while omitting method bodies; in the Python setup, private underscore-prefixed functions are filtered out.</p></li><li><p><strong>Sliding Window</strong>: Splits files into fixed-size line-aligned windows with configurable line overlap, without using syntax structure. This method is simple, language-agnostic, but risks splitting functions awkwardly. </p></li><li><p><strong>cAST</strong>: Operates over all Abstract Syntax Tree (AST) node types, recursively splitting large nodes and greedily merging adjacent siblings to produce more uniform chunks.</p></li></ul>
      <p>
          <a href="https://aiexpjourney.substack.com/p/the-hidden-bottleneck-in-rag-code">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Building Better LLM Agents with Smarter Memory Architectures — AI Innovations and Insights 137]]></title><description><![CDATA[If you&#8217;ve ever used an AI assistant for a long project, you&#8217;ve probably felt the frustration: it is brilliant one moment and forgetful the next.]]></description><link>https://aiexpjourney.substack.com/p/building-better-llm-agents-with-smarter</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/building-better-llm-agents-with-smarter</guid><pubDate>Tue, 02 Jun 2026 02:07:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!T329!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever used an AI assistant for a long project, you&#8217;ve probably felt the frustration: it is brilliant one moment and forgetful the next.</p><p>That gap between short bursts of intelligence and sustained understanding is exactly where memory becomes impossible to ignore.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Background </h2><p>LLMs fundamentally operate by responding based on current context. Conversation history, tool invocation, or task execution depends essentially on the information provided in the prompt.</p><p>However, real-world agents typically aren&#8217;t limited to answering single, standalone questions; they need to handle longer-term tasks involving multi-turn dialogues, personal assistance, software engineering projects, gaming, scientific discovery, and more. For these complex, long-duration tasks, memory becomes a critical module for LLM-based agents. Memory allows agents to accumulate knowledge, reuse past experiences, and make more coherent decisions throughout ongoing tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7n6x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7n6x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 424w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 848w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 1272w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7n6x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png" width="593" height="423.57142857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:840,&quot;resizeWidth&quot;:593,&quot;bytes&quot;:202736,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196862816?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7n6x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 424w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 848w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 1272w, https://substackcdn.com/image/fetch/$s_!7n6x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1029b8b2-1169-4fcb-81a2-7e34a9a5ddda_840x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Overview of naive long-context prompting and memory-augmented prompting. [<a href="https://arxiv.org/pdf/2604.01707v2">Source</a>].</figcaption></figure></div><p>As illustrated in Figure 1, the simplest method to manage context is &#8220;packing large portions of history directly into the prompt,&#8221; known as <strong>naive long-context prompting</strong>. But this approach quickly encounters practical issues: context windows overflow, token consumption increases significantly, latency grows, and the model might struggle to reliably pinpoint crucial information from overly lengthy contexts. </p><p>In contrast, <strong>memory-augmented prompting</strong> stores interaction history in an external memory system, retrieving the relevant bits as needed. When well designed, this can  reduce unnecessary tokens and help the model focus more effectively on truly valuable context. Here, &#8220;memory&#8221; isn&#8217;t some mysterious cognitive process like human memory; rather, it&#8217;s an external mechanism designed to process, store, and update historical interactions, retrieving useful information when answering new questions.</p><p>This distinction clarifies why agent memory and Retrieval-Augmented Generation (RAG) are similar but fundamentally different concepts. RAG primarily retrieves factual evidence from external document or knowledge bases, supplementing domain-specific information and reducing hallucinations. Agent memory, however, specifically stores information relevant to the current user, ongoing conversations, and long-term interactions, such as user preferences, key events, historical decisions, and task constraints. While the two approaches can complement each other, their focuses are distinct.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xrlu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xrlu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 424w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 848w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 1272w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xrlu!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png" width="1200" height="300.8241758241758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:365,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:191724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196862816?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xrlu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 424w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 848w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 1272w, https://substackcdn.com/image/fetch/$s_!xrlu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50b6c7e2-2d3c-4944-adae-3d5f03dfa13c_2112x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Classification of representative agent memory methods. [<a href="https://arxiv.org/pdf/2604.01707v2">Source</a>].</figcaption></figure></div><p>Over the years, numerous agent memory methods have emerged, including A-MEM, MemoryBank, MemGPT, Mem0, Mem0&#119892;, MemoChat, Zep, MemTree, MemoryOS, and MemOS. </p><p>However, the literature points out that the problem isn&#8217;t a lack of methods but rather the absence of a unified framework for comparison. Many existing methods report overall performance but rarely dissect the contributions of individual modules. Moreover, systematic comparisons focusing on accuracy and efficiency across methods remain insufficient. </p><p>Figure 2 starts to make these differences explicit by categorizing representative methods along four dimensions: information extraction, management operations, storage structure, and retrieval mechanism.</p><h2>Core Issues with Existing Methods</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T329!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T329!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 424w, https://substackcdn.com/image/fetch/$s_!T329!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 848w, https://substackcdn.com/image/fetch/$s_!T329!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!T329!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T329!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png" width="1200" height="749.1758241758242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:909,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:4861513,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196862816?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T329!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 424w, https://substackcdn.com/image/fetch/$s_!T329!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 848w, https://substackcdn.com/image/fetch/$s_!T329!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!T329!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F666322ba-48b9-409d-87dc-5843efb110a2_2536x1584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: Core challenges in agent memory design: long-context prompting is costly and unreliable, raw memories require active management, and diverse memory architectures need a unified framework for fair comparison. <strong>Image by author.</strong></figcaption></figure></div><p>Current approaches generally struggle in three main areas.</p><ul><li><p>First, directly placing extensive historical interactions into prompts quickly becomes expensive and unreliable. Context windows get overloaded, resulting in high token usage and making it tough for models to reliably extract meaningful information.</p></li><li><p>Second, memory systems aren&#8217;t as simple as saving raw text snippets. Effective memory must also handle redundant, conflicting, outdated, or structurally related information. Raw dialogue can be valuable, but it needs to be paired with management operations, such as consolidation, updating, filtering, and linking, so the system can preserve useful detail without drowning retrieval in noise.</p></li><li><p>Third, existing methods differ significantly in their extraction, storage, management, and retrieval strategies. Without a unified framework, it is hard to compare the trade-offs among tree structures, graph-based memories, summarization strategies, and retrieval mechanisms under the same experimental setting.</p></li></ul><p>This leads to essential questions: For tasks involving long-term dialogues and multi-session interactions, how can a memory mechanism for LLM-based agents be systematically designed, compared, and evaluated? Such a system needs to reliably store, update, and retrieve historical information within limited contexts and token budgets.</p><h2>Agent Memory: Core Concept</h2><p>Here&#8217;s a simple way to visualize how an agent&#8217;s memory system works: think of it as an organized archive room that sorts itself automatically. Every time a user sends a new message, this archive room first decides which information is worth keeping. Next, it considers whether to merge this new information with existing files, update outdated records, or remove duplicates. Then, it files the information into the most appropriate cabinet, such as short-term storage, long-term archives, a vector database, or a graph-based store. Later, when the user asks a question, the system fetches relevant information from these cabinets and hands it to the LLM.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/building-better-llm-agents-with-smarter">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Can We Build an Atlas for AI Research? — AI Innovations and Insights 136]]></title><description><![CDATA[Every researcher knows the strange feeling of reading a paper and realizing there are ten earlier papers you probably should have read first.]]></description><link>https://aiexpjourney.substack.com/p/can-we-build-an-atlas-for-ai-research</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/can-we-build-an-atlas-for-ai-research</guid><pubDate>Fri, 29 May 2026 02:13:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GbYg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcdda19-7f43-4046-86e1-a720828d5c10_2536x1616.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every researcher knows the strange feeling of reading a paper and realizing there are ten earlier papers you probably should have read first.</p><p>The hard part is not finding more papers, but understanding how the ideas actually grew into what we see today.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why AI Research Still Relies on Human Intuition</h2><p>Existing academic tools like Google Scholar, Semantic Scholar, and OpenAlex are largely centered around papers themselves. They can show which paper cites which, but they rarely capture how one method evolves from another, what bottlenecks it addresses, or which new directions it opens. Understanding these &#8220;method evolution relationships&#8221; still largely relies on researchers reading papers and mentally mapping the connections.</p><p>Now, AI research agents are starting to participate in the scientific process. Unlike humans, they struggle to reliably reconstruct the evolution of methods from vast amounts of unstructured papers. This limitation hinders their ability to grasp research trajectories, spot gaps, and evaluate or generate new ideas.</p><p>What if there were a better approach? Imagine creating an explicit, queryable graph that captures methods, the evolutionary links between them, and evidence showing how specific bottlenecks were overcome. Such a graph could serve as the foundation for AI research agents, giving them a structured understanding of the landscape instead of forcing them to piece it together from raw text.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/can-we-build-an-atlas-for-ai-research">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[DocSeeker: Page by Page, Evidence by Evidence — AI Innovations and Insights 135]]></title><description><![CDATA[If you have ever asked a model a question about a long report and received a confident but unverifiable answer, you know the frustration.]]></description><link>https://aiexpjourney.substack.com/p/docseeker-page-by-page-evidence-by</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/docseeker-page-by-page-evidence-by</guid><pubDate>Mon, 25 May 2026 02:37:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lUWm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you have ever asked a model a question about a long report and received a confident but unverifiable answer, you know the frustration.</p><p><strong>Long-document understanding</strong> is not just about reading more pages, it is about not losing trust along the way.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e4X2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e4X2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 424w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 848w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 1272w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e4X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:337815,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196637105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!e4X2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 424w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 848w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 1272w, https://substackcdn.com/image/fetch/$s_!e4X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F969a0133-f177-4e23-8a33-78d9d09fdc6a_1578x780.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Overview of Principal Experimental Results in Long Document Understanding (Left) and the ALR Reasoning Paradigm of DocSeeker (Right). [<a href="https://arxiv.org/pdf/2604.12812v4">Source</a>].</figcaption></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2>Why Long Documents Break Multimodal Models</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lUWm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lUWm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 424w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 848w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lUWm!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png" width="1200" height="793.6813186813187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:963,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:4538585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196637105?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lUWm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 424w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 848w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!lUWm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05c440d7-5794-4afe-9e28-1aaf0312f9c8_2378x1572.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Why long-document QA remains difficult. Existing systems typically follow one of three routes: OCR/parsing-based extraction, visual RAG over top-K retrieved pages, or end-to-end MLLM processing of all page images. Each route has a different failure mode: OCR can lose visual/layout information, RAG must balance missing evidence against adding noise, and end-to-end MLLMs face low signal-to-noise ratios as key evidence is buried in long visual contexts. A broader bottleneck is supervision: most datasets provide only final short answers, leaving evidence localization and reasoning weakly supervised. Image by the author.</figcaption></figure></div><p>Currently, existing approaches for long-document question answering generally fall into three categories:</p><ul><li><p><strong>OCR/Parsing-based Methods:</strong> These methods first extract text and layout using OCR or PDF parsers and then analyze the extracted information using LLMs. Examples include LongFormer, LayTokenLLM, HiVT5, DocLLM, and DocVLM. The main challenges are sensitivity to OCR errors and the potential loss of important layout and visual details.</p></li><li><p><strong>Visual RAG Methods:</strong> These methods initially use visual retrieval to find the top-K most relevant pages before passing them to multimodal large language models (MLLMs) for answering. Representative retrievers include ColPali and DSE, while RAG systems include VisRAG (<a href="https://aiexpjourney.substack.com/p/ai-innovations-and-trends-02-visrag">AI Innovations and Trends 02: VisRAG, GraphRAG, RAGLAB, and More</a>), SV-RAG, VDocRAG, and M3DocRAG (<a href="https://aiexpjourney.substack.com/p/multimodal-rag-unveiled-a-deep-dive">MultiModal RAG Unveiled: A Deep Dive into Cutting-Edge Advancements</a>). The primary issue here is determining the optimal number of pages (top-K): too few pages can miss key evidence, while too many can introduce unnecessary noise.</p></li><li><p><strong>End-to-End MLLM Methods:</strong> These methods directly input multi-page document images into multimodal large language models for processing. Examples of this approach are mPLUG-DocOwl2 (<a href="https://aiexpjourney.substack.com/p/let-ai-instantly-parse-heavy-documents">Let AI Instantly Parse Heavy Documents: The Magic of MPLUG-DOCOWL2&#8217;s Efficient Compression</a>), Qwen2.5-VL, and InternVL3. However, handling the large number of visual tokens in lengthy documents leads to a severe <strong>low Signal-to-Noise Ratio (SNR)</strong>, where crucial evidence is buried within vast irrelevant content, negatively affecting both efficiency and overall performance.</p></li></ul><p>In addition, beyond architectural limits, most existing datasets only provide final short answers. Training only on short-answer labels can encourage models to learn brittle shortcuts or dataset-specific patterns, rather than robust, interpretable evidence localization and reasoning. </p><h2>DocSeeker: Three-step ALR Visual Reasoning</h2><p>DocSeeker is built around a simple idea: breaking down long-document question answering into three clear steps: first analyzing the question, then locating relevant evidence, and finally reasoning based on that evidence. This is designed to make evidence localization and reasoning explicit, giving users page-level traces that can be checked.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/docseeker-page-by-page-evidence-by">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Secret Behind Claude Code's Retrieval: Why Live Search Fits Better than RAG — AI Innovations and Insights 134]]></title><description><![CDATA[In AI coding tools, finding code isn&#8217;t just about a search box.]]></description><link>https://aiexpjourney.substack.com/p/the-secret-behind-claude-codes-retrieval</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/the-secret-behind-claude-codes-retrieval</guid><dc:creator><![CDATA[Florian]]></dc:creator><pubDate>Thu, 21 May 2026 08:55:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!43bX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In AI coding tools, finding code isn&#8217;t just about a search box. The model needs to navigate a real, ever-changing local codebase to locate relevant snippets. It cannot read the whole repository into context. It needs to respect file-read permissions, avoid flooding the model with irrelevant matches, and return evidence that can be verified by reading actual source files.</p><p>As a standout among coding tools, <strong>Claude Code&#8217;s approach to code retrieval grabbed my attention.</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>After reading its <a href="https://github.com/tanbiralam/claude-code">leaked source code</a>, I found that Claude Code tackles this with a clear goal: enabling the model to explore codebases like an engineer. It starts by locating relevant files, narrows down content, reads crucial snippets, and occasionally leverages the Language Server Protocol (LSP) to understand definitions and references.</p><p>The notable design choice is that Claude Code does not appear to maintain a persistent full-text index or a vector database for code content by default. Instead, it relies heavily on live ripgrep searches, structured tool calls, and precise file reads.</p><h3>Quick Background: Terms You Need to Know</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!43bX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!43bX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 424w, https://substackcdn.com/image/fetch/$s_!43bX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 848w, https://substackcdn.com/image/fetch/$s_!43bX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 1272w, https://substackcdn.com/image/fetch/$s_!43bX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!43bX!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png" width="1200" height="768.9560439560439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:933,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!43bX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 424w, https://substackcdn.com/image/fetch/$s_!43bX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 848w, https://substackcdn.com/image/fetch/$s_!43bX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 1272w, https://substackcdn.com/image/fetch/$s_!43bX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e77e7b-564c-4231-8f5e-56dc9135aaaa_2400x1538.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Code retrieval architecture of Claude Code: a typical path from broad search to precise reading, with ripgrep powering Glob/Grep, optional LSP intelligence, and @file path indexing as a side channel. Image by the author.</figcaption></figure></div><p><strong>Glob</strong> patterns are rules for matching file paths. For example, <code>src/**/*.ts</code> means &#8220;all TypeScript files under src.&#8221; In Claude Code, GlobTool narrows the search space by finding candidate paths. Its implementation lives in <code>src/tools/GlobTool/GlobTool.ts</code>.</p><p><strong>Grep</strong> handles content searches. It answers the question: &#8220;Which files contain lines matching this pattern?&#8221; Claude Code&#8217;s <code>GrepTool</code> supports three output modes: <code>content</code>, <code>files_with_matches</code>, and <code>count</code>. You can find its implementation in <code>src/tools/GrepTool/GrepTool.ts</code>.</p><p><strong>ripgrep</strong>, usually invoked as <code>rg</code>, is essentially a fast text search tool. It&#8217;s great at searching large numbers of files using regular expressions. Claude Code wraps it in <code>src/utils/ripgrep.ts</code> rather than reimplementing file scanning itself.</p><p><strong>Read</strong> means precise reading. Search tools such as Grep and Glob usually help indicate &#8220;where something might be relevant,&#8221; but actually understanding the code requires reading exact file range. Claude Code handles that through <code>FileReadTool</code>.</p><p><strong>LSP</strong>, or Language Server Protocol, <a href="https://github.com/tanbiralam/claude-code/blob/9f51e71641c1472293ebe4a7968b44455c00f0ba/src/tools/LSPTool/prompt.ts#L5">provides IDE-like capabilities</a> such as go- to-definition and find-references. It understands language structure better than plain text search. Claude Code&#8217;s <code>LSPTool</code> is useful when a language server is available, but it is not the baseline retrieval mechanism.</p><p><strong>Indexing</strong> refers to a pre-built structure for quick lookups. Claude Code does have an index, but it primarily supports <code>@file</code> path auto-completion rather than full-text content search. The corresponding implementation is in <code>src/native-ts/file-index/index.ts:43</code>.</p><h3>Where This Fits in the Tool System</h3><p>Claude Code exposes search as structured tools, not as arbitrary shell text.</p><p><strong>At a high level</strong>, the execution path looks like this: Model decides it needs code -&gt; emits a tool call -&gt; toolExecution validates the input schema -&gt; permissions are checked -&gt; GrepTool / GlobTool / Read / LSP runs -&gt; results are bounded and returned -&gt; model decides the next step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hszG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hszG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 424w, https://substackcdn.com/image/fetch/$s_!hszG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 848w, https://substackcdn.com/image/fetch/$s_!hszG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!hszG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hszG!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png" width="1200" height="627.1978021978022" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:761,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hszG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 424w, https://substackcdn.com/image/fetch/$s_!hszG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 848w, https://substackcdn.com/image/fetch/$s_!hszG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!hszG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d12de-a015-4bc4-8d4b-089faa6ce6a2_2400x1255.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Claude Code exposes code retrieval as structured tools: model-driven calls pass through schema validation, permission checks, tool execution, and shaped results before the model decides the next step. Image by the author.</figcaption></figure></div><p>The shared <code>Tool</code> abstraction is defined in <code>src/Tool.ts</code>. It gives each tool a schema, validation logic, permission checks, execution behavior, and metadata such as whether the tool is read-only or safe to run concurrently.</p><p>This matters because code retrieval is not only about locating text. It also has to be controlled. A dedicated <code>GrepTool</code> can enforce path rules, output limits, permission checks, and UI grouping in a way that raw Bash commands cannot.</p><h3>Core Flow: How a Single Code Search Works</h3><p>When the model needs to find a specific implementation, it might first invoke <code>GrepTool</code> to search keywords.</p><p><code>GrepTool</code> begins by validating inputs like <code>pattern</code>, <code>path</code>, <code>glob</code>, <code>type</code>, and <code>output_mode</code> against a schema. Then, it verifies that the target path exists and checks read permissions through filesystem permission logic in <code>src/utils/permissions/filesystem.ts</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PVyT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PVyT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PVyT!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png" width="1200" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PVyT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 424w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 848w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 1272w, https://substackcdn.com/image/fetch/$s_!PVyT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58454cad-975d-4d18-a4d6-098ad9748ae2_1672x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: A visual walkthrough of how a model turns code-search intent into validated tool calls, ripgrep execution, optional glob discovery, and evidence-backed file reads. Image by the author.</figcaption></figure></div><p>After validation, <code>GrepTool</code> translates user inputs into parameters for <code>ripgrep</code>. There are key details here:</p><ul><li><p>version-control directories are excluded from normal content search;</p></li><li><p>glob and file-type filters can narrow the search;</p></li><li><p>context lines can be requested when needed;</p></li><li><p>very long lines are capped with a max-column setting to avoid flooding the model context;</p></li><li><p>deny rules can be converted into ignore patterns so blocked files do not appear in search results.</p></li></ul><p>That wrapper chooses between available <code>ripgrep</code> implementations and handles process-level concerns such as timeouts, cancellation, buffering, partial output, and exit codes. For example, ripgrep&#8217;s exit code <code>1</code> means &#8220;no match found,&#8221; which Claude Code treats as an empty result rather than an error.</p><p>If the model knows the filename shape but lacks specific content keywords, it might first use <code>GlobTool</code>. Under the hood, <code>GlobTool</code> enumerates matching paths using <code>rg --files --glob</code>, implemented in <code>src/utils/glob.ts</code>.</p><p>Once candidates are found, the model calls <code>Read</code> to inspect exact ranges. This keeps the retrieval loop evidence-driven: search results identify possible locations, while file reads provide the source code the model actually reasons over.</p><h3>Path Completion: Another Easily Confused Search Mechanism</h3><p>Claude Code also includes something called <code>FileIndex</code>, but it&#8217;s not designed for content searching.</p><p>Instead, <code>FileIndex</code> supports path autocompletion when using the <code>@file</code> command in the user interface. You can find the related logic inside <code>src/hooks/fileSuggestions.ts</code> and <code>src/native-ts/file-index/index.ts</code> .</p><p>The file suggestion path works differently from content search. It gathers project paths, preferring <code>git ls-files</code> when available and falling back to <code>ripgrep -- files</code>. It then builds an in-memory fuzzy index over paths, using structures such as <code>paths</code>, <code>lowerPaths</code>, <code>charBits</code>, and <code>pathLens</code>.</p><p>This design is intentional: users expect immediate responses when typing <code>@xxx</code>, making it worthwhile to maintain a lightweight, in-memory index for path completion.</p><p>For content search, Claude Code uses live <code>ripgrep</code> calls through <code>GrepTool</code>, not a persistent content index. A reasonable inference is that this helps keep results aligned with the current filesystem and current read-permission rules, though the source does not explicitly state that as the design motivation.</p><h3>Key Insight: Why Not Use Retrieval-Augmented Generation (RAG)?</h3><p><strong>The main reason</strong> is that the core problem isn&#8217;t about finding &#8220;semantically similar&#8221; snippets. Instead, it&#8217;s about precisely, instantly, and securely finding evidence in the current local codebase.</p><ul><li><p><strong>A vector-index-based RAG involves this workflow: </strong>Chunking code &#8594; generating embeddings &#8594; storing in vector DB &#8594; similarity search &#8594; feeding results to the LLM.</p></li><li><p><strong>Claude Code uses a very different approach: </strong>Glob finds files &#8594; Grep precisely searches content &#8594; Read extracts snippets &#8594; optionally, LSP queries symbols.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mKpC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mKpC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 424w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 848w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mKpC!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png" width="1200" height="633.7912087912088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:769,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mKpC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 424w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 848w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!mKpC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3564ed0-7618-41c7-b34c-1c1d94bc84d2_2400x1268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: Claude Code favors live, exact, permission-aware search over vector RAG for finding verifiable evidence in local codebases. Image by the author.</figcaption></figure></div><p>Here&#8217;s why default <strong>vector-index-based RAG isn&#8217;t ideal</strong>:</p><ol><li><p><strong>Dynamic codebase: </strong>Files can be created, edited, renamed, or deleted during a session. A vector index would need invalidation, rechunking, re-embedding, and deletion logic. From the source code alone, we cannot say Claude Code rejected RAG for this reason, but the current live-search design avoids that entire class of synchronization problems.</p></li><li><p><strong>Exact matching is critical: </strong>A model may need to find <code>MACRO.VERSION</code>, <code>checkReadPermissionForTool</code>, an error string, an environment variable, a CLI flag, or a route path. These are not fuzzy semantic questions. They are often exact-match or regex-match questions, where <code>GrepTool</code> is a better first tool than embedding similarity.</p></li><li><p><strong>Simpler permission handling: </strong>Claude Code&#8217;s search path checks read permissions and can apply deny rules before returning results. With a vector index, restricted content must either never be indexed or must be reliably filtered at query time, including after permission changes. That is possible, but it is more complex and riskier than applying current deny rules to live search.</p></li><li><p><strong>Verifiable results: </strong>Vector retrieval is less directly verifiable. A similarity result may be useful, but the reason it ranked highly is not always obvious. Grep gives concrete file paths and matching lines. Read then fetches the actual source. That makes the model&#8217;s next step easier to audit.</p></li></ol><p>So the <strong>better summary</strong> is not &#8220;RAG is bad for code.&#8221; It is: Claude Code&#8217;s default retrieval path optimizes for live, exact, permission- aware evidence. Vector RAG optimizes for semantic recall over preprocessed corpora. <strong>Those are different retrieval problems.</strong></p><h3>What&#8217;s Truly Worth Learning Here?</h3><p><strong>From reading the leaked code, in my view</strong>, rather than building one large retrieval system to handle every query, Claude Code splits the code retrieval problem into smaller mechanisms:</p><pre><code>file discovery      -&gt; GlobTool
content search      -&gt; GrepTool
source inspection   -&gt; Read
symbol navigation   -&gt; LSPTool
path completion     -&gt; FileIndex</code></pre><p>For an AI coding agent, that is a practical design. The model does not need a perfect global understanding of the repository upfront. It needs a reliable way to gather evidence, test hypotheses, and refine its search until it reaches the code that matters.</p><p><strong>But I have a concern about this algorithm.</strong></p><p>Repeated broad searches may rescan the filesystem. Search results do not have the kind of semantic ranking a vector system could provide. Grep cannot find conceptually related code unless there is a matching token or pattern. LSP can fill some of that gap, but only when available and applicable. <strong>This means Claude Code&#8217;s approach is</strong> <strong>strongest when</strong> the task can be grounded in names, strings, paths, symbols, and concrete code evidence. It is <strong>weaker when</strong> the user asks a vague conceptual question and no obvious search terms exist.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Corpus2Skill: Turning Flat RAG into Browsable Agent Skill Trees — AI Innovations and Insights 133 ]]></title><description><![CDATA[Agent Skills have become a popular technique recently (AI Exploration Journey: Skills).]]></description><link>https://aiexpjourney.substack.com/p/corpus2skill-turning-flat-rag-into</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/corpus2skill-turning-flat-rag-into</guid><pubDate>Sun, 17 May 2026 14:10:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!b4sD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d3dfc2-2278-4a8f-8a65-f21d46b0d188_1704x914.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Agent Skills have become a popular technique recently (<a href="https://aiexpjourney.substack.com/t/skills">AI Exploration Journey: Skills</a>).  </p><p>There are also emerging approaches that combine Skills with RAG, and this article takes a look at a representative one.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Missing Layer in RAG: Navigability </h2><p>Traditional RAG systems simply retrieve a few similar text passages for the model. </p><p>But when dealing with enterprise knowledge bases, the challenge is different. Questions often require the model to understand the structure of the knowledge base, know where to look, backtrack when necessary, and combine evidence across topics. </p><p>As a result, the focus is shifting to making knowledge bases not just searchable, but truly navigable.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F5Xg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F5Xg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 424w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 848w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 1272w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F5Xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png" width="578" height="548.2897196261682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:856,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:148195,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196404102?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F5Xg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 424w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 848w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 1272w, https://substackcdn.com/image/fetch/$s_!F5Xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F592e42f3-32f4-494f-b2ca-d5cd499e6a9e_856x812.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Retrieve vs. Navigate. Traditional RAG passively feeds fixed passages to the LLM. CORPUS2SKILL distills the corpus into a navigable skill hierarchy that the agent actively explores, backtracks, and drills into to locate evidence. [<a href="https://arxiv.org/pdf/2604.14572v2">Source</a>].</figcaption></figure></div><p>Here&#8217;s what that really means:</p><ol><li><p><strong>Standard RAG is too passive: </strong>The model only sees the top-k results and has no sense of what other topics exist in the knowledge base, whether critical documents are missing, or if the current evidence is complete.</p></li><li><p><strong>Complex enterprise questions span multiple topics: </strong>A single Wix support question, for example, might touch on account types, payments, legal entities, and verification policies all at once. Simple similarity-based retrieval can return superficially related articles that don&#8217;t actually address the key issues.</p></li><li><p><strong>Agentic RAG can search iteratively but lacks a map: </strong>Agents can continue issuing searches, but without a corpus map they are still guessing which query terms or search directions will be productive.</p></li><li><p><strong>Existing hierarchical RAGs still hide the hierarchy: </strong>Systems like RAPTOR (<a href="https://aiexpjourney.substack.com/p/advanced-rag-12-enhancing-global">Advanced RAG 12: Enhancing Global Understanding</a>) or GraphRAG (<a href="https://aiexpjourney.substack.com/t/graph-rag">AI Exploration Journey: Graph RAG</a>) build trees or graphs, but queries still rely heavily on vector search. The model cannot directly &#8220;see and browse&#8221; the structure.</p></li></ol><h2>CORPUS2SKILL: The Agentic Shift from Searching to Navigating</h2><p><strong>CORPUS2SKILL</strong> centers on a simple idea: first, organize the document repository offline into a navigable skill forest, then let an LLM explore it like folders when looking for answers, rather than relying on direct vector retrieval.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/corpus2skill-turning-flat-rag-into">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[SkillClaw: One Agent Fails, Every Agent Learns — AI Innovations and Insights 132]]></title><description><![CDATA[Agent skills evolving is currently an exciting new direction.]]></description><link>https://aiexpjourney.substack.com/p/skillclaw-one-agent-fails-every-agent</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/skillclaw-one-agent-fails-every-agent</guid><pubDate>Tue, 12 May 2026 09:12:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7iHw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Agent skills evolving is currently an exciting new direction. I&#8217;ve previously written several articles introducing this topic (<a href="https://aiexpjourney.substack.com/t/skills">AI Exploration Journey: Skills</a>). </p><p>Today, let&#8217;s explore a framework for evolving agent skills specifically designed for the multi-user OpenClaw ecosystem.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Limitations of Existing Skill Methods </h2><p>Every day, LLM agents interact extensively with users, continuously accumulating experiences. Yet, these experiences rarely contribute meaningfully to enhancing the overall system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7iHw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7iHw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 424w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 848w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 1272w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7iHw!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png" width="1200" height="628.8461538461538" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:763,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:5584096,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/195240654?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7iHw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 424w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 848w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 1272w, https://substackcdn.com/image/fetch/$s_!7iHw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a532522-6a6a-4020-b03d-65aa6016a71e_2966x1554.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Existing skill methods generate plenty of agent experience, but fail to turn fragmented user interactions into shared, continuously improving skills. <strong>Image by author.</strong></figcaption></figure></div><p>To clarify further:</p><ul><li><p>Agent skills today are largely static. Typically, skills are manually installed and maintained. Once deployed, they seldom change. Even if an agent discovers better methods through actual usage, these improvements rarely make it beyond the current interaction and rarely enter the skill repository.</p></li><li><p>Different users repeatedly encounter similar issues. Users often face similar workflows, call similar tools, and even make similar mistakes like incorrect parameter formats, improper tool invocation sequences, or missing validation steps. However, the system doesn&#8217;t integrate these recurrent success patterns or failure modes across users and over time. As a result, many users end up rediscovering similar fixes independently.</p></li><li><p>Existing solutions fall short. Memory-based methods typically just store historical records for easier retrieval, but they struggle to abstract these into genuinely reusable or improvable skills. Conversely, skill-based methods do structure experience into skills, but these repositories are usually static and do not evolve naturally through ongoing use.</p></li></ul><p>If individual users&#8217; fragmented experiences from multiple interactions could be transformed into skills shared and continuously enhanced across the entire multi-user agent system, the impact would be significant. Validated improvements discovered in one context can be propagated through the shared skill repository, so future agents that use the relevant skills can benefit from collective experience.</p><h2>SkillClaw's Architecture: A Closed-Loop Pipeline for Collective Evolution</h2><p>As shown in Figure 2, SkillClaw gathers interaction traces from daily multi-user agent usage, evolves them into improved skills, and then synchronizes these enhanced skills back to all agents, forming an automated feedback loop. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/skillclaw-one-agent-fails-every-agent">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MinerU-Diffusion: A New Path Beyond Autoregressive OCR — AI Innovations and Insights 131]]></title><description><![CDATA[The uncomfortable truth is that some OCR systems look smarter than they are because language helps them fill in the blanks. But when the page stops being predictable, real visual reading becomes much harder to fake.AI Exploration Journey is a reader-supported publication.]]></description><link>https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond</guid><pubDate>Fri, 08 May 2026 01:38:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Uyw3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>The uncomfortable truth is that some OCR systems look smarter than they are because language helps them fill in the blanks.</strong> But when the page stops being predictable, real visual reading becomes much harder to fake.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Where Autoregressive OCR Starts to Break Down</h2><p>Most existing OCR and visual language models (VLMs) rely heavily on autoregressive decoding, meaning they generate text tokens sequentially, one by one, from left to right. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uyw3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png" width="1200" height="759.065934065934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:4272528,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Uyw3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 424w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 848w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!Uyw3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694b9b2a-c57d-48fc-9e35-f9e913d237b4_2396x1516.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: AR-based OCR decodes tokens left to right, causing latency, error propagation, and reliance on language priors when semantics are disrupted. MinerU-Diffusion reframes OCR as inverse rendering and uses block-wise masked diffusion to refine tokens in parallel under visual conditioning, with a tunable speed&#8211;accuracy trade-off. <strong>Image by author</strong>.</figcaption></figure></div><p>While this approach works well for standard text generation tasks, it&#8217;s far from ideal for document OCR. Here&#8217;s why:</p><ol><li><p><strong>Speed Issues:</strong> Documents, especially lengthy ones filled with tables, formulas, and complex layouts, require generating many tokens. Decoding each token sequentially leads to significant latency, slowing the entire recognition process.</p></li><li><p><strong>Error Propagation:</strong> Autoregressive methods are highly sensitive to early mistakes. A single recognition error can distort the context for subsequent tokens, causing a cascade of inaccuracies that build upon one another.</p></li><li><p><strong>Over-Reliance on Language Priors:</strong> In Semantic Shuffle benchmark, AR models often lean heavily on linguistic cues and semantic coherence. This means they may &#8220;guess&#8221; rather than clearly perceive the actual text. When the semantic structure is disrupted or ambiguous, AR performance typically drops dramatically.</p></li><li><p><strong>OCR as Inverse Rendering:</strong> Fundamentally, document OCR is better thought of as &#8220;inverse rendering.&#8221; The goal is to reconstruct structured information (like text, layouts, tables, and equations) from a two-dimensional image. The correct interpretation primarily depends on visual evidence and spatial arrangements. Forcing a strict left-to-right serialization is merely an "implementation artifact" for representation convenience, rather than a fundamental property of how documents are actually structured.</p></li><li><p><strong>A Strong Fit for Diffusion:</strong> Unlike open-ended text generation (like chatting with ChatGPT), OCR is a near-deterministic task with limited semantic ambiguity. This makes OCR a strong candidate for masked diffusion, where masked tokens can be predicted in parallel conditioned on the image and partially observed sequence, producing a tunable speed&#8211;accuracy trade-off.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AlNT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AlNT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 424w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 848w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1272w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png" width="1456" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148795,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AlNT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 424w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 848w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1272w, https://substackcdn.com/image/fetch/$s_!AlNT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c2af8d-70a4-4a7a-8fae-7ba605aedc59_1568x646.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Overview of the document OCR inverse rendering process via different decoding methods.. The model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>Given these considerations, document OCR systems would greatly benefit from decoding strategies that are parallelized, globally consistent, and strongly grounded in visual features. Rather than forcing OCR into the sequential patterns of autoregressive language generation, it&#8217;s more natural to employ methods designed specifically to exploit visual structure.</p><h2>MinerU-Diffusion: From Left-to-Right OCR to Parallel Visual Decoding</h2><p>MinerU-Diffusion uses diffusion-based decoding instead of the traditional autoregressive method, enabling the model to simultaneously confirm or correct multiple tokens through visual context. This approach boosts processing speed, reduces error propagation, and decreases reliance on linguistic context for guessing content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Xcp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 424w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 848w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png" width="1456" height="1103" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1103,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:530923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Xcp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 424w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 848w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!2Xcp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4400ed01-82ac-4d73-9e68-443bf45c5db9_1568x1188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: (a) The confidence threshold controls decoding parallelism in MinerU-Diffusion. Compared to MinerU2.5, this method achieves up to 3.26&#215; speedup. (b) MinerU-Diffusion maintains a strong accuracy&#8211;efficiency trade-off, achieving 2.12&#215; speedup with 99.9% and 3.01&#215; speedup with 98.8% relative accuracy. (c) Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>The method can be understood through four practical components.</p><h4>1. Unified Output Format</h4><p>Text, layout annotations, table symbols, and formula indicators are all represented as a unified sequence of tokens. </p><p>For document parsing, the model outputs a structured sequence rather than only plain text; task-specific prompts can still produce plain text, LaTeX, or table markup.</p><h4>2. Diffusion-Based Decoding Replacing Autoregression</h4><p>During training, tokens are randomly masked, prompting the model to predict these masked elements based on the surrounding context and visual evidence from the document image.</p><p>At inference, the model progressively reconstructs masked positions, using already decoded context and visual features rather than generating strictly left to right. Over multiple iterative rounds, uncertain tokens are progressively <strong>revealed and</strong> corrected rather than sequentially generating each token from left to right.</p><h4>3. Block-wise Diffusion</h4><p>Diffusing across an entire document sequence can be slow and unstable, so sequences are divided into smaller blocks:</p><ul><li><p><strong>Within blocks</strong>: Diffusion is parallelized, and context is considered bidirectionally.</p></li><li><p><strong>Between blocks</strong>: A coarse, front-to-back dependency helps preserve sequence coherence and reduce long-range drift. </p></li><li><p><strong>System Efficiency</strong>: The causal (front-to-back) structure across blocks naturally enables <strong>efficient KV-caching</strong> during inference, reducing memory and computation costs compared to full-attention diffusion models.</p></li></ul><p>This design maintains fast parallel decoding while mitigating position drift and error accumulation common in lengthy documents.</p><h4>4. Confidence-Driven Dynamic Decoding + Two-Stage Training</h4><p>During inference, tokens with high confidence are confirmed first, while low-confidence tokens undergo further iterative correction. Confidence thresholds balance decoding speed and accuracy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vqsu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 424w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 848w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png" width="1456" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vqsu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 424w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 848w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3582f8dd-a31d-4dd8-b5d8-308b9104e845_1564x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>After multimodal initialization, training happens in two stages: initial broad-scale training provides general capabilities, followed by an <strong>uncertainty-driven</strong> refinement. The model automatically mines challenging examples (like complex tables or ambiguous boundaries) <strong>by measuring its own inference consistency</strong>, focusing its learning on the hardest cases to enhance robustness.</p><p>In short, MinerU-Diffusion treats document OCR as the inverse problem of reconstructing structured text from images, leveraging block-wise diffusion to parallelly refine tokens, and employing confidence-driven scheduling and challenging-case training to boost decoding speed, stability, and reliability.</p><h2>Evaluation</h2><h4>Document Parsing Evaluation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!82wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!82wd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 424w, https://substackcdn.com/image/fetch/$s_!82wd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 848w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1272w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png" width="1456" height="585" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:585,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!82wd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 424w, https://substackcdn.com/image/fetch/$s_!82wd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 848w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1272w, https://substackcdn.com/image/fetch/$s_!82wd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ac1200-d1b2-4efb-9803-94953d2ff505_1564x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5: Comprehensive evaluation of document parsing on OmniDocBench v1.5. &#8593; denotes higher is better, &#8595; denotes lower is better. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>MinerU-Diffusion&#8217;s capability in full-page document parsing is evaluated using OmniDocBench v1.5, measuring its performance through various metrics such as text edit distance, formula correctness (CDM), table extraction quality (TEDS), and reading order.</p><p>The results showed that MinerU-Diffusion achieved an overall score of <strong>88.94</strong> without using ground-truth layouts. When provided with ground-truth layouts, the score improved significantly to <strong>93.37</strong>, coming very close to the performance of strong autoregressive OCR systems. This mainly shows that once layout errors are removed, its recognition quality is highly competitive.</p><h4>Efficiency Evaluation</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fyib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fyib!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 424w, https://substackcdn.com/image/fetch/$s_!fyib!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 848w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1272w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png" width="1456" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/196095932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fyib!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 424w, https://substackcdn.com/image/fetch/$s_!fyib!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 848w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1272w, https://substackcdn.com/image/fetch/$s_!fyib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F638f108a-0e17-477a-af7c-324c9e1daa81_1538x364.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Figure 6: Threshold sensitivity analysis of TPF, TPS, and accuracy. TPF denotes tokens per forward, and TPS refers to throughput measured on an NVIDIA H200 GPU with a batch size of 1. [<a href="https://arxiv.org/pdf/2603.22458v1">Source</a>].</figcaption></figure></div><p>Efficiency was tested by adjusting the confidence thresholds, which determine how many tokens the model finalizes in a single decoding step. Lower thresholds led to faster decoding speeds, while higher thresholds improved stability. </p><p>MinerU-Diffusion achieved up to a <strong>3.2&#215; decoding speedup</strong>, maintaining a clear advantage in speed even at high accuracy levels.</p><h2>Thoughts</h2><p>At its core, MinerU-Diffusion transforms OCR decoding from sequential token-by-token generation into a visually-driven, block-wise diffusion process: tokens are refined in parallel within each block, while blocks retain a coarse front-to-back dependency.</p><p>Coupled with uncertainty-driven curriculum training, this shift represents a fundamental change at the decoding paradigm level, not merely swapping out the underlying model backbone.</p><p><strong>But I have a concern.</strong> </p><p>Block boundaries could introduce new sources of subtle errors. <strong>While </strong>MinerU-Diffusion<strong> mitigates this by allowing tokens to causally attend to preceding blocks, they are strictly cut off from future blocks.</strong> Structures like headers, footers, table cells, or formulas spanning line breaks might still be disrupted if they fall near these boundaries. Such systemic fragmentation might not clearly surface through averaged evaluation metrics.</p><div><hr></div><p>Reference: </p><ul><li><p>Paper: <a href="https://arxiv.org/pdf/2603.22458v1">MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</a>.</p></li><li><p>Code: <a href="https://github.com/opendatalab/MinerU-Diffusion">https://github.com/opendatalab/MinerU-Diffusion</a>.</p></li></ul><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading AI Exploration Journey! Feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://aiexpjourney.substack.com/p/mineru-diffusion-a-new-path-beyond?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item><item><title><![CDATA[MultiDocFusion: From Flat Chunks to Hierarchy-Aware RAG — AI Innovations and Insights 130]]></title><description><![CDATA[For RAG, a lot of us begin with the same almost invisible premise: chunk the document, embed the chunks, retrieve the top matches, and the rest will take care of itself.]]></description><link>https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to</guid><pubDate>Mon, 04 May 2026 02:04:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cKD1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For RAG, a lot of us begin with the same almost invisible premise: chunk the document, embed the chunks, retrieve the top matches, and the rest will take care of itself. </p><p>That story holds up right until a long industrial PDF arrives on your desk and makes one thing obvious: documents were never meant to be understood as confetti.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Hidden Pitfalls of Naive RAG Chunking</h2><p>Many RAG systems retrieve relevant chunks from documents and then ask an LLM to answer from that retrieved context.</p><p>However, there&#8217;s a catch: to make documents searchable, especially longer ones, they usually have to be split into smaller pieces. Traditional methods for splitting these documents can be overly simplistic, such as slicing by a fixed number of words or strictly relying on semantic meaning.</p><p>In practice, this approach often leads to problems. Real-world documents aren&#8217;t simply chunks of plain text, they often include:</p><ul><li><p>Multiple levels of headings and subheadings</p></li><li><p>Tables, figures, and other layout-heavy elements</p></li><li><p>Content spanning across multiple pages</p></li><li><p>Scanned PDFs that require OCR processing</p></li></ul><p>When documents are just &#8220;split by length,&#8221; it&#8217;s akin to randomly chopping up a textbook, headings end up isolated from their sections, paragraphs get separated from accompanying tables, and critical context can get lost altogether. </p><p>This fragmentation can make retrieval return disjointed evidence, which can reduce answer quality. The problem becomes especially pronounced when dealing with industrial documents, financial reports, legal contracts, and scanned materials.</p><h2>Core Idea</h2><p>To make RAG over long documents more faithful, three kinds of signals need to be modeled together:</p><ul><li><p>Visual layout (what the document looks like)</p></li><li><p>Textual content (what the document says)</p></li><li><p>Structural hierarchy (how the document is organized)</p></li></ul><p>As shown in Figure 1, MultiDocFusion combines visual layout, text, and hierarchy; the name &#8220;Fusion&#8221; refers to that integration, while &#8220;MultiDoc&#8221; also reflects support for diverse document formats and corpus-level multi-document RAG scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ux6L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ux6L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 424w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 848w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png" width="728" height="421.8426966292135" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:722,&quot;width&quot;:1246,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:178052,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194862047?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ux6L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 424w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 848w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ux6L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F176006bc-0868-42ac-9856-9ca3c3865f24_1246x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: The pipeline for MultiDocFusion. The figure illustrates the step-by-step process for handling a long industrial document. (a) DP extracts layout structures; (b) OCR recognizes and annotates text; (c) DSHP-LLM constructs a hierarchical tree from identified section headers and general nodes; (d) DFS-based Grouping constructs coherent hierarchical chunks for retrieval tasks. The color-coded blocks represent document elements: yellow for Root and Title, red for Section Headers, and green for general nodes (tables, figures, and text blocks). [<a href="https://arxiv.org/pdf/2604.12352v1">Source</a>].</figcaption></figure></div><h4>An Intuitive Analogy</h4><p>Traditional fixed-length methods are more like slicing a book at fixed token boundaries.</p><p>MultiDocFusion, on the other hand, acts like an intelligent reader: it dynamically reconstructs the document's structural tree (like building its own table of contents on the fly) and carefully groups content that naturally belongs together.</p><p>This makes it particularly effective for handling long, complex documents that reflect real-world complexities.</p><h2>MultiDocFusion: How Does It Actually Work?</h2><p>Think of it as a four-step process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cKD1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cKD1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 424w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 848w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1272w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cKD1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png" width="1200" height="787.0879120879121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:955,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1176450,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194862047?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cKD1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 424w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 848w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1272w, https://substackcdn.com/image/fetch/$s_!cKD1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0dee6203-1350-4c0b-ba04-fc6b17735fde_1470x964.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: MultiDocFusion in 4 steps. Image by author.</figcaption></figure></div><h4>Step 1: Understanding the Document Layout (Document Parsing)</h4>
      <p>
          <a href="https://aiexpjourney.substack.com/p/multidocfusion-from-flat-chunks-to">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Trace2Skill: Distilling Agent Experience into Transferable Skills — AI Innovations and Insights 129]]></title><description><![CDATA[In a previous article (Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities), I explained what agent skills are.]]></description><link>https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience</guid><pubDate>Thu, 30 Apr 2026 02:50:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6KRW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd271851-fea3-4580-9c68-91c1528ec09e_1356x498.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In a previous article (<a href="https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable">Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities</a>), I explained what agent skills are. </p><p>Today, the focus shifts to a new method for acquiring those skills.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Agents Need a Master Manual, Not a Pile of Sticky Notes </h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4wiG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4wiG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png" width="728" height="485.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1291266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194611304?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4wiG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4wiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec099b53-7f98-4dbf-9bec-d08689f04a6f_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: An example of Skill. Image by author.</figcaption></figure></div><p>A skill is more than just a single file, it's a structured knowledge directory containing a root SKILL.md (the instruction manual) alongside auxiliary resources like executable scripts and reference files.</p><p>As agents tackle more specialized domains, the demand for highly specialized, comprehensive skill documentation grows, but human production simply cannot keep up.</p><p>There are also two major challenges:</p><p>First, asking a large language model to generate skills purely from its parametric knowledge often misses the real pain points, operational nuances, and common failure modes of a target domain. The resulting skills might look plausible on paper, but they rarely provide meaningful help in actual tasks.</p><p>Second, many existing online update methods based on trajectory data tend to fragment knowledge and overfit to local experience.</p><ul><li><p>Fragmentation: Many approaches extract a lesson from each trajectory, ending up with a heap of disconnected skills that are hard to search and apply.</p></li><li><p>Sequential updates: Updating skills after every new trajectory is like trying to revise a guide without ever fully understanding the domain. It&#8217;s easy for a single trajectory to skew the knowledge.</p></li></ul>
      <p>
          <a href="https://aiexpjourney.substack.com/p/trace2skill-distilling-agent-experience">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MinerU2.5-Pro: Teaching a 1.2B Model to Parse Like a Giant — AI Innovations and Insights 128]]></title><description><![CDATA[In my earlier articles, I already covered the core idea behind MinerU (From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77).]]></description><link>https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model</guid><pubDate>Sun, 26 Apr 2026 01:02:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!E_t9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16f2cd1e-0714-4246-b92b-1561c75c5f7e_1424x646.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my earlier articles, I already covered the core idea behind MinerU (<a href="https://aiexpjourney.substack.com/p/from-big-picture-to-details-mineru">From Big Picture to Details: MinerU 2.5 Redefines Document Parsing &#8212; AI Innovations and Insights 77</a>). </p><p>In this post, let&#8217;s take a look at its latest progress.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Key Constraint in Document Parsing</h2><p>A recurring pattern has emerged: state-of-the-art models with different architectures and parameter sizes often make almost identical mistakes on challenging real-world PDF samples. </p><p>Simply switching to a larger model is unlikely to solve the problem. Instead, the bottleneck lies in a shared set of data-related challenges that limit progress across the board.</p><p>As a result, the key constraint on further advances in document parsing may no longer be model architecture, but the training data itself.</p><p>This data bottleneck has two main dimensions:</p><ul><li><p>First, insufficient coverage. Training datasets contain plenty of common page layouts, but long-tail scenarios such as complex nested tables, dense mathematical formulas, and <strong>unconventional multi-column layouts</strong> remain underrepresented.</p></li><li><p>Second, there is an annotation quality paradox: many of the most informative samples, especially the difficult ones, are exactly the cases where automatic annotation is least reliable. Because of this, simply scaling raw data volume is insufficient: without better sampling and more reliable labels, it can amplify both distribution bias and annotation noise.</p></li></ul><p>Given these challenges, an important question arises: if the architecture remains unchanged, can systematic data engineering and staged training still deliver significant gains, and can a more discriminative evaluation protocol reveal those gains more faithfully? </p><p><strong>More provocatively: can a 1.2B-parameter model rely purely on data to outperform giants with over 200x more parameters?</strong></p><h2>MinerU2.5-Pro: Data Engine</h2>
      <p>
          <a href="https://aiexpjourney.substack.com/p/mineru25-pro-teaching-a-12b-model">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Agent Skills: Distill Tasks into Discoverable, Reusable Capabilities — AI Innovations and Insights 127]]></title><description><![CDATA[Building Agents is like training a new colleague who&#8217;s brilliant but forgetful.]]></description><link>https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable</guid><pubDate>Thu, 23 Apr 2026 01:04:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fBjM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0305ca1-e38d-4189-8940-c87f5e5b77ea_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Building Agents is like training a new colleague who&#8217;s brilliant but forgetful. </p><p>They know what to do, just not always in the right order or at the right time.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Skills are Needed?</h2><p>Over the past two years, discussions around Agents have often focused on bigger models, longer context windows, and more tools. </p><p>But the real challenge in <strong>production-level</strong> Agents isn&#8217;t whether the model can do something, it&#8217;s whether it can reliably complete a task the same way every time.</p><p>As Agent&#8217;s tasks grow longer, environments become more complex, and branches multiply, language models tend to drift. Steps get skipped, sequences get jumbled, tools are used inconsistently, and stopping conditions vary. </p><p>These challenges are known as procedural burden of agency.</p><p>Before diving into the introduction of Skills, let&#8217;s take a look at where they fit within an Agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!deG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!deG_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 424w, https://substackcdn.com/image/fetch/$s_!deG_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 848w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!deG_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png" width="1200" height="864.5604395604396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1049,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1006422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194257779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!deG_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 424w, https://substackcdn.com/image/fetch/$s_!deG_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 848w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!deG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F893d06b4-9a66-416c-9cad-0d509211372a_1468x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Externalization as the organizing principle of LLM agent design. Upper panel: The arc of human cognitive externalization from thought through language, writing, printing, to digital computation. Middle panel: The corresponding externalization arc for LLM agents, from weights through three externalization dimensions&#8212;Memory (externalized state), Skills (externalized expertise), and Protocols (externalized interaction)&#8212;to the Harness that unifies them. Lower panel: A literature landscape mapping representative works onto three capability layers&#8212;Weights, Context, and Harness&#8212;illustrating how research threads have progressively migrated outward. The parallel between the two arcs encodes a recursive claim: LLM agents achieve reliable agency by externalizing cognitive burdens along the same representational dimensions that have driven human cognitive history. [<a href="https://arxiv.org/pdf/2604.08224v1">Source</a>].</figcaption></figure></div><h2>Skills</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q6l2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 424w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 848w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png" width="1200" height="1183.5164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1436,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:442684,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/194257779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q6l2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 424w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 848w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6l2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2044a4e0-f153-4cfd-82f5-424fedf112dd_1756x1732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: An example of Skills. Image by author.</figcaption></figure></div><p>Skills were created as an externalization response to this procedural burden.</p><p>A skill is not just a tool, and not merely a prompt file; it is an externalized form of procedural expertise. Tools answer the question &#8220;what actions can be performed,&#8221; protocols answer &#8220;how these actions are described and invoked,&#8221; and a Skill answers &#8220;how this type of task should be handled as a whole.&#8221;</p><p>A truly reusable Skill has at least three layers: the operational flow, decision heuristics, and governing constraints. In other words, it not only tells an Agent what to do first and what to do next, but also guides which path to take when there are branches, and defines which boundaries must never be crossed.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/agent-skills-distill-tasks-into-discoverable">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MARCH: A Multi-Agent Framework for Factual RAG and Hallucination Reduction — AI Innovations and Insights 126]]></title><description><![CDATA[Have you ever run into this problem when working on a RAG project: even with retrieved evidence at hand, the model still produces answers that seem correct but are actually wrong?]]></description><link>https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for</guid><pubDate>Sat, 18 Apr 2026 04:53:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!04-G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60594711-cd53-4513-ae6f-837110f5b801_1342x642.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever run into this problem when working on a RAG project: even with retrieved evidence at hand, the model still produces answers that seem correct but are actually wrong? And the usual self-checking or verification methods often get misled by the model&#8217;s own output.</p><p>This post offers some interesting ideas.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>Hallucination remains a critical bottleneck for LLMs, undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. </p><p>There are three core challenges:</p><ul><li><p>First, even with access to documents, models can misreport numbers, mix up timelines, or make reasoning jumps that ignore the evidence. In fields like finance, law, and healthcare, these mistakes can have serious consequences.</p></li><li><p>Second, existing training signals are too coarse. Traditional supervised fine-tuning may teach a model to &#8220;sound convincing,&#8221; but not to ensure every fact is accurate. Standard reinforcement learning methods, including RLHF, usually provide a single score for the final answer. Even newer approaches like RL with Verifiable Reward (RLVR) are bottlenecked by the scarcity of expert annotations and the limited reasoning ceilings of external verifiers.</p></li><li><p>Third, current verifiers suffer from confirmation bias <strong>due to "information leakage"</strong>. When a judge sees the question, the source documents, and the model&#8217;s answer at the same time, there&#8217;s a strong tendency to justify the original response rather than independently verify it. This is a fundamental flaw in many hallucination detection methods.</p></li></ul><h2>MARCH: Three Agents, One Goal for Tighter Grounding and Fewer Hallucinations</h2><p>MARCH is a framework that separates answer generation from answer verification and intentionally creates <strong>information asymmetry</strong>.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/march-a-multi-agent-framework-for">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[MOCR: From Text-Only OCR to Parse-Anything Intelligence — AI Innovations and Insights 125]]></title><description><![CDATA[If you work with documents long enough, you start noticing how often the most important information isn&#8217;t in the paragraphs.]]></description><link>https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse</guid><pubDate>Tue, 14 Apr 2026 04:48:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!I573!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you work with documents long enough, you start noticing how often the most important information isn&#8217;t in the paragraphs. It hides in charts, tables, diagrams, and all the parts traditional OCR still treats like decorative noise.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Traditional OCR Is No Longer Enough</h2><p>Traditional OCR and many document parsing pipelines are still largely text-centric: they recover text well, but usually do not structurally parse information-dense graphics.</p><p>When confronted with charts, diagrams, icons, or UI components, they usually treat these elements as mere image fragments and store them as pixel blocks. As a result, a significant portion of structural and semantic information within documents is lost.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I573!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I573!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 424w, https://substackcdn.com/image/fetch/$s_!I573!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 848w, https://substackcdn.com/image/fetch/$s_!I573!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1272w, https://substackcdn.com/image/fetch/$s_!I573!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I573!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png" width="1200" height="547.2527472527472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:664,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:482524,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/193134730?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I573!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 424w, https://substackcdn.com/image/fetch/$s_!I573!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 848w, https://substackcdn.com/image/fetch/$s_!I573!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1272w, https://substackcdn.com/image/fetch/$s_!I573!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320da26a-cfd7-4fe7-8b8d-809605a91abc_1636x746.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Comparison between traditional text-only OCR and MOCR paradigms. Traditional OCR treats graphics as pixels and often discards them, while MOCR parses graphics into structured code (e.g. SVG), enabling faithful reconstruction and broader downstream applications. [<a href="https://arxiv.org/pdf/2603.13032v2">Source</a>].</figcaption></figure></div><p>Figure 1 captures the main shift: traditional OCR keeps graphics as raster regions, while Multimodal OCR represents text and eligible visual symbols in a unified serialized format, using plain text, table markup, LaTeX, or SVG depending on the element type. </p><h2>MOCR: Not Just Reading Text, but Reconstructing Graphics</h2><p>MOCR, short for Multimodal OCR, aims to parse anything found in documents. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/mocr-from-text-only-ocr-to-parse">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[PaperBanana: Surpassing Nano-Banana, Let the Paper Draw Itself — AI Innovations and Insights 124]]></title><description><![CDATA[Current AI can read papers, write code, and even suggest ideas, but many researchers still end up wrestling with figures by hand.]]></description><link>https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana</guid><pubDate>Fri, 10 Apr 2026 02:27:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!x8Xs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ee9d90-6241-4ca2-a72c-01134381c16a_1916x804.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Current AI can read papers, write code, and even suggest ideas, but many researchers still end up wrestling with figures by hand. That mismatch says a lot about what is still missing in the research workflow.</p><p>This post might offer an insightful idea.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Motivation </h2><p>There are two common approaches to creating <strong>methodology diagrams and statistical plots</strong> for research papers today, and neither of them quite gets the job done. </p><ul><li><p>The first is code-based drawing, using tools like TikZ, Python-PPTX, or SVG. This approach works well for structured diagrams and precise layouts. But it starts to fall short when dealing with the kinds of visuals that show up in modern papers, things like intricate icons, custom shapes, and polished, detail-heavy designs. Flexibility becomes a real limitation.</p></li><li><p>The second approach is to rely on image generation models. These can produce visually appealing results with very little effort. The problem is consistency. It is hard to guarantee that the output meets the standards of academic work, both in terms of factual accuracy and visual conventions. A figure might look good at first glance, but still require careful checking to ensure <strong>logical faithfulness and academic readability</strong>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yW7A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yW7A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 424w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 848w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1272w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png" width="726.7890625" height="423.16336862664474" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:708,&quot;width&quot;:1216,&quot;resizeWidth&quot;:726.7890625,&quot;bytes&quot;:955850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192685509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yW7A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 424w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 848w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1272w, https://substackcdn.com/image/fetch/$s_!yW7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeef55f5-71c0-482a-8afa-7b4407a4346e_1216x708.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Examples of methodology diagrams and statistical plots generated by PaperBanana, which show the potential of automating the generation of academic illustrations. [<a href="https://arxiv.org/pdf/2601.23265v2">Source</a>].</figcaption></figure></div><p>In practice, this means researchers often end up spending a significant amount of time manually fixing figures anyway, even after using these tools.</p><h2>PaperBanana: Read, Plan, Stylize, and Sketch Science</h2><p>PaperBanana is an agentic, reference-driven framework that tries to bridge this gap by synthesizing aesthetic guidelines from curated academic examples.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/paperbanana-surpassing-nano-banana">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[KohakuRAG: Climb the Document Tree, Find the Right Evidence — AI Innovations and Insights 123]]></title><description><![CDATA[If you&#8217;ve ever built a RAG pipeline at work, you probably know the feeling: retrieval looks fine in demos, then quietly falls apart on real questions.]]></description><link>https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree</guid><pubDate>Mon, 06 Apr 2026 09:21:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aqjS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;ve ever built a RAG pipeline at work, you probably know the feeling: retrieval looks fine in demos, then quietly falls apart on real questions.</p><p>This post is interesting because it introduces a method that tackles that gap in a very practical, engineering-first way.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Problems of Standard RAG</h2><p>Standard RAG is often not reliable enough for hardcore tasks, like the WattBot 2025 Challenge, which demands &#177;0.1% numeric precision and exact source attribution. In practice, three issues show up again and again:</p><ul><li><p>Flat chunking breaks semantic boundaries and document structure, which makes precise source tracking much harder and weakens confidence in citation quality.</p></li><li><p>A single query can easily miss relevant evidence when the wording does not line up, especially when different terms are used for the same idea.</p></li><li><p>A single generation pass is inherently unstable. Both the answer and its citations can vary from run to run, and in some cases, terrified of hallucinating, the system may unnecessarily abstain (outputting a blank) despite the evidence being buried right there in the context.</p></li></ul><h2>KohakuRAG: From Flat Chunks to Citation-Ready Trees</h2><p>At its core, KohakuRAG is a lightweight RAG framework built on a four-level hierarchical document index: document &#8594; section &#8594; paragraph &#8594; sentence. </p><p>It preserves document structure through bottom-up embedding aggregation, improves retrieval coverage with LLM-based query planning plus cross-query reranking, and stabilizes final outputs with abstention-aware ensemble voting during generation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aqjS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aqjS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 424w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 848w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1272w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aqjS!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png" width="1200" height="538.8888888888889" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:582,&quot;width&quot;:1296,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:188418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192494795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aqjS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 424w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 848w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1272w, https://substackcdn.com/image/fetch/$s_!aqjS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36cf4765-adc1-48dc-a4eb-37d0e3241c9f_1296x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Overview of KohakuRAG. Left (Hierarchical Indexing): Documents are parsed into tree structures with sections, paragraphs (Para), and sentences (S). Sentence embeddings are computed and aggregated bottom-up to parent levels, then stored in a Vector DB. Center (Multi-Query Retrieval): Given a question, the Query Planner (LLM) generates multiple related queries, each retrieving Top-K results that are merged via Cross-Query Reranking. Right (Ensemble Inference): Context and question are sent to the LLM for m independent runs; blank responses are filtered (X), and majority voting produces the final answer. [<a href="https://arxiv.org/pdf/2603.07612v1">Source</a>].</figcaption></figure></div><h4>1. Offline hierarchical document indexing</h4><p>The first step is document parsing.</p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/kohakurag-climb-the-document-tree">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Three Paradigms Shaping Modern OCR — AI Innovations and Insights 122]]></title><description><![CDATA[Anyone who has worked with PDFs, forms, or scanned reports knows this feeling: extracting text is easy until structure starts to matter.]]></description><link>https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern</link><guid isPermaLink="false">https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern</guid><pubDate>Fri, 03 Apr 2026 11:04:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QPKT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Anyone who has worked with PDFs, forms, or scanned reports knows this feeling: extracting text is easy until structure starts to matter. </p><p>The real shift is this: OCR is no longer just a reading task, but a document understanding problem.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://aiexpjourney.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">AI Exploration Journey is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Modern OCR Is About Understanding Documents</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QPKT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QPKT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1647586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/192090936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QPKT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!QPKT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9b1734-9924-47ef-9d71-3048e04e94ea_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: OCR Beyond Text: The Rise of Document Intelligence. Image by author.</figcaption></figure></div><p>In the past, OCR usually meant one thing: recognizing text from an image.</p><p>Today, the scope of OCR has expanded far beyond basic character recognition. Real-world document processing rarely stops at reading text. It often involves layout analysis, table parsing, chart understanding, question answering, and extracting key information from complex pages. Documents are no longer just lines of text. They contain structure, hierarchy, and meaning that must be understood as a whole.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8_g5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8_g5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 424w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 848w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1272w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png" width="1456" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:348405,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://aiexpjourney.substack.com/i/179540724?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8_g5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 424w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 848w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1272w, https://substackcdn.com/image/fetch/$s_!8_g5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e76da27-08f8-4761-b28a-1d8fd75e8537_1760x700.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Rapid growth of document parsing methods since June 2025. [<a href="https://arxiv.org/pdf/2511.10390v2">Source</a>].</figcaption></figure></div><p>Because of this shift, OCR has gradually evolved from a character recognition tool into something much closer to a document intelligence system. Instead of only asking what characters are on the page, modern OCR systems are expected to understand how the page is organized and what the content actually represents.</p><h2>Three Paradigms of Modern OCR</h2><p>Current OCR paradigms can roughly be grouped into three categories: </p><ul><li><p>Pipeline OCR systems</p></li><li><p>End-to-end OCR models</p></li><li><p>General vision-language models. </p></li></ul><p>These three approaches reflect different stages in the evolution of OCR, and each comes with its own set of trade-offs.</p><h2>Traditional Pipeline OCR: Strong in Control, Weak in Fragmentation</h2><p>The most classic design is the pipeline OCR system. A typical workflow starts with layout detection, followed by element-level recognition, and finally a rule-based step that assembles everything into the final output. </p>
      <p>
          <a href="https://aiexpjourney.substack.com/p/the-three-paradigms-shaping-modern">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>