2026-03-19 Session Log
LinguaRAG: full-stack refactoring (API proxy, backend split, repository pattern), guest auth optimization, OCR A/B tests (RapidOCR + BGE-M3 rejected), Google Document AI integration, frontend review
lingua-rag
Eliminated redundant Supabase calls for guests, consolidated 20 API routes into a shared proxy pattern, split a monolithic backend module, and unified the repository instantiation pattern
API Proxy Migration
The frontend had 20 API route files that all repeated the same pattern: get Supabase session, build headers, fetch from backend, handle errors. I extracted this into two shared utilities in api-proxy.ts:
getSession()— reads the Supabase cookie server-side and returns the access token (or null for guests/super-mode)proxyToBackend()— forwards the request to the FastAPI backend with auth headers, content-type passthrough, and error handling
This removed ~600 lines of duplicated boilerplate. Each API route is now 5-10 lines instead of 30-40.
Guest Auth Skip
Supabase’s official Next.js guide calls getUser() on every request in middleware. But for guests (no auth cookie), this makes two unnecessary network calls per page load. I added a cookie existence check — if no sb-*-auth-token cookie exists, skip the Supabase call entirely.
The tricky part: Supabase splits large auth cookies into chunks (.0, .1, etc.), so the check uses startsWith("sb-") && includes("-auth-token") to catch all variants. The onAuthStateChange listener still stays registered for detecting Google One Tap logins.
Backend Module Split
pdfs.py was handling three unrelated concerns: file management, text annotations, and vocabulary entries. Split it into three routers:
| Module | LOC | Responsibility |
|---|---|---|
pdfs.py | 370 | PDF upload, download, delete, metadata, indexing |
pdf_annotations.py | 95 | Text highlights and notes on PDF pages |
pdf_vocabulary.py | 150 | Vocabulary entries saved from PDF selections |
All three keep the same /api/pdfs/ URL prefix, so zero frontend changes were needed.
Repository Pattern Unification
Some backend modules created repository instances at module level (singletons), while others created them per-request. Unified everything to per-request instantiation — repositories are stateless (they use the connection pool for each query), so there’s no performance cost, but you gain consistency, testability, and no shared-state risk across concurrent requests.
Other Refactoring
- SSE helpers — extracted
sse()andsse_done()from duplicated definitions inchat.pyandguest.pyintocore/sse.py - SessionLockManager — extracted duplicated
OrderedDict+ mutex logic intocore/locking.pywith configurable LRU eviction - Frontend storage utils —
getStorageItem,setStorageItemwith try-catch wrappers, replacing rawlocalStoragecalls across 6 files - Frontend format utils — consolidated
formatFileSize,formatRelativeTime,formatDateLabelfrom three different components
OCR & Embedding A/B Tests
Tested two open-source alternatives to replace paid services — both rejected for failing on the core use case: cross-language search
RapidOCR vs Haiku Vision (OCR)
Hypothesis: RapidOCR with Korean PaddleOCR v5 model could replace Haiku Vision ($0.13/77 pages) for free.
Result: Text extraction ratio reached 61% (up from 22% with default model), but Chinese characters like 句 were consistently dropped in mixed CJK documents. Since the app serves Korean users studying foreign language textbooks (Chinese, Japanese, German), dropping non-Korean CJK characters is a deal-breaker. Rejected.
BGE-M3 vs OpenAI text-embedding-3-small (Embeddings)
Hypothesis: BGE-M3 (MIT license, free, 1024-dim) could match OpenAI embeddings for multilingual semantic search.
Result: Same-language overlap was strong (English: 68%), but cross-language overlap collapsed — German: 32%, Chinese: 24%, overall: 41% (threshold was 60%). The app’s primary search pattern is exactly cross-language: a Korean user asking questions about German/Chinese text. Rejected.
Decision
Both alternatives were kept in the codebase behind environment variable switches (OCR_MODE, EMBEDDING_MODE) for future re-evaluation when better models emerge. The paid stack (Haiku Vision + OpenAI embeddings) stays as default.
Google Document AI Integration
Implemented as a third OCR option — faster and cheaper than Haiku Vision, but couldn’t A/B test due to GCP billing issues
Google Document AI processes an entire PDF in a single API call (no page-by-page rendering needed), extracts text via layout text anchors, and supports 200+ languages. Cost is ~$0.0015/page ($0.12 for 77 pages vs $0.13 for Haiku Vision) and processes 77 pages in ~3 seconds vs ~60 seconds.
The integration is complete (OCR_MODE=google_docai), but the actual A/B test is blocked: the GCP billing account was closed, and creating a new one keeps failing on the payment step. Deferred until billing is resolved.
Indexing Bug Fixes
Six bugs found during code review of the indexing pipeline
| Bug | Fix |
|---|---|
| Dead code | Removed unused _extract_pages function and all_text_pages variable |
| Status flash | Text PDFs briefly showed ocr_indexing before ready — added condition to check if OCR pages actually exist beyond the first batch |
| OCR parsing | Regex replaced brittle startswith("=== Page ") with flexible pattern matching (===Page 1===, --- Page 1 ---, case-insensitive) |
| Text truncation | max_tokens 4096 was too small for dense OCR pages — increased to 8192 |
| Duplicate chunks | Phase 2 chunk_index started at 0 instead of using Phase 1’s count as offset |
| Empty pages | _chunk_page now returns [] for whitespace-only text instead of creating empty chunks |
Frontend Review (Rex Analysis)
Comprehensive frontend audit covering architecture, code quality, and performance
Strengths
- Auth pattern is excellent — Supabase SSR middleware with cookie-first guest bypass, proper
getUser()(notgetSession()) for JWT validation - TypeScript strict mode — only 3
anyusages, all with eslint-disable comments and clear justification - Clean useEffect patterns — every event listener has cleanup, dependency arrays are accurate
- Zero console.log in production code
Key Issues Found
| Issue | Severity | Detail |
|---|---|---|
| PdfViewer monolith | Critical | 2,832 lines — text selection, annotations, highlights, pinyin, keyboard shortcuts all in one file |
| ChatContext God Object | Critical | 7+ hooks composed into one Context — tree clicks cause cascading re-renders in unrelated chat components |
| Heavy components not code-split | High | PdfViewer, NoteSlidePanel (1,315 LOC), Tiptap editor always in the bundle even when not visible |
| No route-level error boundaries | Medium | Only root error.tsx exists — a chat error takes down the entire app |
| No form validation framework | Medium | Manual validation only, no Zod schemas at API boundaries |
| Minimal SEO | Low | Static metadata only, no OG images or per-route generateMetadata |
Recommended Priority
- Dynamic import PdfViewer (low effort, immediate bundle improvement)
- Split ChatContext into Auth/Pdf/UI contexts (medium effort, performance gain)
- Add route-level
error.tsxfor chat pages (low effort, UX resilience) - Lazy-load NoteSlidePanel and PronunciationModal (low effort, bundle optimization)
- PdfViewer hooks decomposition (high effort, maintainability)
What’s Next
- Google Document AI A/B test — resolve GCP billing, compare vs Haiku Vision
- PdfViewer decomposition (2,832 LOC → Core + hooks)
- NoteSlidePanel tab split, sidebar actions extraction
- Prompt v3.0 end-to-end testing
- Vercel unpause + production deployment