PDF RAG Indexing: Unit Detection and Chunk Noise Filtering
How to reliably detect structured unit boundaries in a bilingual PDF and prevent boilerplate text from polluting RAG vector chunks.
2 posts
How to reliably detect structured unit boundaries in a bilingual PDF and prevent boilerplate text from polluting RAG vector chunks.
Hard-won lessons from building a robust PDF chunker for a Korean-German textbook: multiple detection guards, line-level copyright stripping, and RAG behavior verification.