#pdftotext

1 posts

PDF RAG Indexing: Unit Detection and Chunk Noise Filtering

How to reliably detect structured unit boundaries in a bilingual PDF and prevent boilerplate text from polluting RAG vector chunks.