Retrieval Foundation for a RAG-Based AI System

I started building a RAG engine because I was tired of asking large language models questions limited by whatever context I could paste or attach into a chat window. I was looking for information around licensing requirements, regulatory steps, eligibility criteria for things I needed to interact with the DRC government and partners on. The information existed somewhere, scattered across documents that were not searchable.

I wanted two things: the ability to interrogate large knowledge bases like legal documentation, and to understand how to build a system like this from the inside.

I looked at a few existing open source RAG projects first. None among the ones I found focused on the kind of corpus I had in mind. And even if one had, adopting it would have skipped the fun part - learning what goes into architecting and building a retrieval system, not just running one.

I also looked at NotebookLM. It has evolved to support larger uploads, multi-document handling, reasonable grounded answers. Still, as of now it's a standalone product. Limited control. No extension points. No way to evaluate what's actually happening inside retrieval. I wanted infrastructure I could extend, evaluate, compose, and deploy on my own terms.

My first mental model of retrieval was something close to advanced search. Search the whole corpus, return anything matching a phrase. What I actually wanted was different. Say I'm looking into a fintech license in the DRC. A keyword search gets me documents that mention "fintech license." What I wanted was to ask what the requirements actually are, what the eligibility criteria look like, and get an answer assembled from the corpus, not a list of places where the words appear.

That gap between finding a mention and getting an answer is most of what this part is about.

I have multiple use cases that need this. A legal assistant for DRC regulatory documents is the one that pushed me to start, with the corpus itself coming together separately. Others are a domain-specific knowledge base for enterprise systems, and an agentic component I can drop into other products.

I am building it with coding agents. Codex as the lead implementer and Claude Code lead reviewer. I plan the phases, make the architectural decisions, validate the output, and prepare the corpus. This is the first of several build log entries.

What this part covers:

Corpus Intake: feeding and identifying source documents and preventing duplicate ingestion
Extraction & Chunking: converting documents into clean page-level text and splitting into traceable chunks
Indexing: turning chunks into semantic and lexical indexes for retrieval
Query & Retrieval: taking a question, searching the indexes, merging and ranking the best evidence
Answer Generation: producing a grounded answer only from validated evidence, with citation checks and fallback handling
Storage & Evaluation: persisting query traces and measuring retrieval quality

Corpus intake

Documents enter the system as identified hashed records before any text is read.

Identity is the primary key for each document being ingested. Without it at intake, the same document can be ingested multiple times under different file names, silently creating duplicate vectors that corrupt retrieval scores.

The runtime

Five services run through Docker Compose: PostgreSQL, Qdrant, Neo4j, the API, and the frontend. My local setup is Windows 11 with Ubuntu on WSL. On Windows/WSL, host-port conflicts are constant. I configured host ports via .env while keeping container-internal ports fixed — service-to-service communication stays predictable, and the host routes around whatever else is running.

docker compose --env-file .env -f docker/docker-compose.yml --profile local up -d --build --wait
docker compose --env-file .env -f docker/docker-compose.yml --profile local ps
curl http://localhost:8000/health

A fragile runtime is expensive to debug. Port conflicts surfaced as retrieval failures multiple sessions later - no obvious connection. Getting this stable first saved time downstream.

Deduplication

The deduplication spec: SHA-256 over file content - not the file name. File names are unreliable. Content hashes are stable across renames and re-downloads. Just in case, I added a --force flag to give the operator an explicit escape hatch for re-ingestion.

def is_duplicate_source(store: SQLiteMetadataStore, source_path: Path) -> bool:
    source_sha256 = sha256_file(source_path)
    return store.get_document_by_source_sha256(source_sha256) is not None

One hash check prevents a whole class of bugs.

Extraction & chunking

The aim: turn documents into page records, then provenance-chained chunks, then feed into the embedder.

Extraction records

I reused pagewise-pdf-extractor, a separate library I created for this project and later open-sourced, to handle PDF parsing. Then created a thin adapter that calls the extractor, converts the output into internal records, persists them.

The first corpus: DRC legal documentation in French, with character sets and formatting that generic parsers handle badly.

Every PDF produces three kinds of records: Document (source identity, SHA-256 hash), ExtractionRun (extractor version, config, timestamp), and per-page Page records - including failed pages, stored rather than dropped. Storing failures means extraction problems are visible and recoverable.

Chunking

The first spec for chunking was raw text splitting. That was wrong. I pushed back after seeing the first outputs. Raw splitting has no memory of where a chunk came from. When a citation is wrong, you want to trace exactly which document, extraction run, page, position, and chunker config produced it.

I redesigned it as a recursive markdown/text chunker over a naive fixed-size split, so chunks respect document structure while still enforcing size constraints. Sentence splitting was fragile on legal text - DRC regulatory documents have long, structured sentences that don't map cleanly to semantic boundaries. Fixed-size splitting ignores sentence structure entirely.

Chunk IDs are generated from stable ingredients:

chunk_id = hash(document_id, extraction_run_id, page_number, chunk_position, char_start, char_end, content_hash)

That makes re-indexing safe, citation display stable, and regression debugging tractable. Overlap is validated at config time. Invalid settings are rejected before any chunk is written.

Indexing

Chunks become two parallel indexes: one for semantic similarity (dense retrieval), one for exact-term matching (sparse retrieval). Both written in the same ingestion pass.

A single dense vector index answers "what's semantically close?" It can't reliably answer "does this exact phrase appear?" Legal text requires both: dense retrieval for context questions, sparse retrieval for article numbers, decree identifiers, and institution codes. Building both in one pass avoids a separate sync problem.

Embeddings

The first version tied embedding logic directly to OpenAI. I spec'd the ability to switch to a local model for development and compare providers later without rewriting the pipeline. Both are part of a configurable adapter interface, with OpenAI and Ollama for now. Both return vectors with provider, model, and dimension metadata attached.

embedder:
  openai:
    model: text-embedding-3-small
    dimension: 1536
    api_key_env: OPENAI_API_KEY
  ollama:
    model: nomic-embed-text
    dimension: 768
    endpoint: http://localhost:11434

For the first baseline I used text-embedding-3-small, with nomic-embed-text via Ollama as the local development alternative. The multilingual performance on French was the primary driver, alongside wanting the best retrieval quality without unnecessary model complexity.

The dimension difference is an operational constraint, not a toggle. Switching embedding providers means rebuilding the index from scratch. Existing vectors are incompatible with a new dimension. I required that both the model and dimension that produced each vector be stored to inform future migrations.

One gap this eval doesn't close in part 1: the embedding model has not been compared against an alternative. The smoke results confirm the pipeline works with text-embedding-3-small. They say nothing about retrieval quality, whether this model surfaces the best possible evidence for the given text.

The lexical index

Legal documents are full of exact identifiers with article numbers, decree codes, institution names. Semantic similarity finds what's nearby in meaning; it doesn't reliably surface "Article 12 of Decree 18/027" when that's the exact string the user typed.

That's where BM25 comes in. BM25 (Best Match 25) is the standard probabilistic term-frequency ranking algorithm. It scores term frequency against corpus-wide rarity, normalizes for document length, and requires no training or inference.

BM25 runs as a file-backed index when Qdrant's sparse slot is unavailable, and as a Qdrant sparse vector when the collection supports it. The retrieval layer doesn't care which and offers the same interface either way.

Query & retrieval

A user question passes through normalization, two parallel searches, rank fusion, and a cross-encoder reranker before a single candidate is considered for an answer.

Why dense search wasn't enough

Dense retrieval alone surfaces semantically similar content. It doesn't reliably rank an exact article identifier above everything else when that's precisely what was asked. Hybrid retrieval solves the coverage problem. Cross-encoder reranking solves the ranking problem. The engine requires both, in that order.

The hybrid retrieval pipeline:

query → normalize → dense search (Qdrant) + lexical search (Qdrant sparse or BM25)
      → reciprocal rank fusion → BGE reranker → traceable candidates

I chose Reciprocal Rank Fusion (RRF) as the fusion strategy over score-based fusion. Score-based fusion requires calibrated, comparable scores across backends. Qdrant cosine similarity scores and BM25 scores are not on the same scale. RRF operates on ranks only, which is backend-agnostic and stable.

BAAI/bge-reranker-base runs last. In the smoke test, it ranked the Article 12 chunk above the Article 2 chunk for an Article 12 query. Without the reranker, exact-match lexical hits could be buried by semantic results.

The query planner is deliberately minimal: normalization and strategy recording only. I considered query decomposition and multi-query expansion, then deferred both. They need eval evidence before they go near production, especially with exact legal identifiers that must keep lexical priority.

Every candidate carries a trace: which backend found it, what its rank was before and after fusion, whether reranking moved it. Basic yet useful observability before answer generation.

Qdrant: the vector database backing both indexes

Qdrant stores and queries the dense and sparse vectors produced during ingestion. Both retrieval paths hit it at query time.

python -m ingestion.ingest --source "path\to\pdf-or-folder" --force

Two bugs were identified during live smoke tests.

Bug 1: SQLite contained a document record before the Qdrant collection had been created. The CLI attempted to delete old vectors, received a “collection not found” response from Qdrant, and ingestion failed. Fix: deleting vectors for a document is a no-op when the collection is absent. A missing collection should not be treated as a connection failure.

Bug 2: During a smoke run with a different metadata database, the forced ingestion path would not detect existing vectors for the same document in Qdrant. It skipped deletion and upserted the vectors again, creating duplicates. Fix: on forced ingestion, always delete by deterministic doc_id before upserting, regardless of records in the current metadata database.

A third finding: enabling Qdrant sparse search on an existing dense-only collection is a schema migration, not a config task. A collection created for dense vectors does not include a sparse vector slot. The migration preserves existing data but requires confirmation before running.

Smoke run output:

python -m ingestion.ingest --source "working\smoke-folder-source" --metadata-db working\smoke-folder-metadata.db --output-dir working\smoke-folder-extracted --force

ingestion summary: total=2 ingested=2 skipped=0 failed=0 chunks=13

QDRANT_COLLECTION_PRESENT=True
DOC=agentic-rag-levels.pdf;METADATA_PRESENT=True;VECTOR_COUNT=7
DOC=build-rag-from-scratch.pdf;METADATA_PRESENT=True;VECTOR_COUNT=6

Answer generation

A model generates from whatever it receives. Left unconstrained, it will improvise from weak context, cite chunks it wasn't given, and produce confident wrong answers; i.e. hallucinations. This section is about the boundary between retrieval and generation: what gets through, what gets rejected, and how failure is handled usefully.

Retrieval finds candidates. Each generated answer maps to a specific retrieved chunk. That's what separates a RAG system from a retrieval-assisted hallucination machine.

The evidence boundary

Generation is a contract. The model shouldn't see raw top-k results and improvise from whatever looks plausible. It gets a validated evidence set, answers only from that, cites the exact chunk IDs it used, and fails in a useful way when the evidence is weak.

While designing this layer I looked at a pattern where a side agent retrieves and filters candidate memories, verifies relevance before injecting them into the main agent's context. I dismissed that architecture, as it seemed to solve a different problem. But I retained one thing: after retrieval and reranking, query similarity is not the same as answer relevance. This means a chunk can rank highly in retrieval and still not support the answer.

I put that insight into the spec for a bounded evidence validator (a pipeline stage, not an agent loop). It filters irrelevant chunks, flags weak or contradictory context, records which chunk IDs were supported or rejected, and allows at most one corrective retrieval pass. It sits between the reranker and the LLM.

Two failure modes handled:

Evidence gaps (retrieval failures): no retrieved context, retrieved-but-unsupported context, ambiguous questions, contradictory evidence - caught before the LLM sees anything.
Citation hallucination (generation failures): the model claims a chunk ID that wasn't in the supplied context - caught after generation and rejected.

Citation validation

If the model cites chunk IDs that weren't in the supplied context, the answer is rejected. Valid citations map back to document, page, section, chunk ID, and a compact preview.

When related evidence exists but doesn't support the exact claim, the response says what's unsupported and summarizes the nearest supported facts with citations. A bare "no evidence" proved unhelpful in practice.

Multilingual answering

The answer language is an explicit contract, separate from retrieval filters. A French question can still retrieve English-language source documents unless the user explicitly filters by document language. Auto-detection runs when confidence is high enough, and can be overridden per request. My defaults currently cover English, French, and Swahili.

This matters for legal text specifically. DRC regulatory documents exist in French. Questions about them may come in French or English or any one of DRC's national languages. If language choice acts as a retrieval filter, it silently narrows the corpus. A French-speaking user asking about an English-language law should get the right answer, in French. Language is a presentation and generation behavior. It is not a retrieval filter.

The golden query evaluation pack has a dedicated slice for this: legal-french-question-english-source. A French-language question against an English-language corpus, scored for both retrieval quality and answer language correctness.

Storage & evaluation

Every query produces a compact trace stored in PostgreSQL. Every retrieval change is measured against golden queries before being called an improvement.

Traces

Traces are designed to make every query debuggable without storing what shouldn't be stored. PostgreSQL stores query history, retrieved chunk metadata, responses, citations, latency, and feedback - compact identifiers and ranks only. No raw document text, no provider payloads, no API keys - compact identifiers and ranks only. Enough to diagnose any retrieval failure. Nothing that becomes a liability in terms of privacy.

The eval harness

An eval harness is a repeatable test suite that runs a fixed set of queries against the system and measures whether retrieval, citation, and answer quality meet defined expectations.

I required an evaluation harness before anything else beyond basic retrieval. The easiest trap in RAG work is adding intelligence before you can tell whether the last improvement helped. Codex implemented the runner; I wrote the golden queries.

Each golden query defines expected document IDs, page numbers, chunk IDs, and whether insufficient evidence is the correct response:

query: "What does Article 12 require?"
filters:
  jurisdiction: example
expected:
  doc_ids: ["example-law"]
  pages: [3]
  chunk_ids: ["example-law-p3-c1"]
  insufficient_evidence: false

The runner reports recall@K, MRR, citation hit rate, insufficient-evidence correctness, answer-language correctness, and stage-level failure categories.

Before running any eval, I defined exactly what each metric measures.

Recall and MRR are scored only within the eval variant's top-K, not across all results.
Negative controls without expected support don't dilute ranking metrics.
Partial citation coverage counts as a failure only when retrieval found all expected support.
Document/page pairs are positional, not accidental cross-products.

These rules matter the first time you debug a regression and need to know exactly which metric failed at which stage.

The first live eval used the smoke corpus with 2 PDFs:

python -m pytest tests/integration/test_smoke_eval.py

Smoke corpus:
- 2 PDFs
- 13 completed pages
- 13 reconstructed chunks

Metrics:
recall@K:                          1.0
MRR:                               1.0
citation hit rate:                 1.0
insufficient-evidence correctness: 1.0
stage-level failure count:         0

These are perfect scores across the board on a 2-document corpus. But the system is far from production-ready. For now, the pipeline is wired correctly and the harness works against real stored data and can catch regressions. That's the milestone.

This part achieved document intake, extraction, chunking, embedding, hybrid search, reranking, citation validation, and evaluation storage.

Up next: connecting the backend to a chat interface.

Building a RAG Engine — Part 1: Retrieval Foundation

Corpus intake

The runtime

Deduplication

Extraction & chunking

Extraction records

Chunking

Indexing

Embeddings

The lexical index

Query & retrieval

Why dense search wasn't enough

Qdrant: the vector database backing both indexes

Answer generation

The evidence boundary

Citation validation

Multilingual answering

Storage & evaluation

Traces

The eval harness

Comments

Building a RAG Engine

Building a RAG Engine — Part 2: An Interface to Evaluate It

More from this blog

Building a RAG Engine — Part 4: Engine Controls

Building a RAG Engine — Part 3: Domain-Neutral Architecture

Building a RAG Engine — Part 2: An Interface to Evaluate It

Command Palette

Corpus intake

The runtime

Deduplication

Extraction & chunking

Extraction records

Chunking

Indexing

Embeddings

The lexical index

Query & retrieval

Why dense search wasn't enough

Qdrant: the vector database backing both indexes

Answer generation

The evidence boundary

Citation validation

Multilingual answering

Storage & evaluation

Traces

The eval harness

Comments

Building a RAG Engine

Building a RAG Engine — Part 2: An Interface to Evaluate It

More from this blog