#regex

2 posts

PDF RAG Indexing: Unit Detection and Chunk Noise Filtering

How to reliably detect structured unit boundaries in a bilingual PDF and prevent boilerplate text from polluting RAG vector chunks.

PDF Indexing Pipeline: Unit Detection Guards and Copyright Filtering

Hard-won lessons from building a robust PDF chunker for a Korean-German textbook: multiple detection guards, line-level copyright stripping, and RAG behavior verification.