Domain-Neutral RAG Architecture

I initially set out to build a RAG system for a legal corpus, then repurpose it later for other domains. In this phase, I set out to create a reusable architecture; i.e. single implementation, multiple domain packs.

The demand is from both my own projects and from different industries. Companies want an intelligence layer sitting on top of everything they already have. Their history, their systems, their accumulated knowledge, accessible and self-adapting instead of locked away in documents nobody reads. Another project around an enterprise operating system had the same need.

This made me rethink the entire architecture. Instead of keeping everything wired up inside the RAG engine, I needed a multi-layer system. RAG becomes the knowledge component inside a larger agentic system, sitting next to tools and orchestration.

The legal corpus remains the first concrete use case, launched as a domain pack.

Almost everything in the engine was still carrying legal-corpus assumptions at that point. The semantic logic, the query transformation, the lack of any real domain concept beyond one standard domain baked in. The new direction: domain packs, as a concept, instead of one domain hardcoded everywhere.

By the end of this phase, RAG had become one component inside the AI system. The system now expands beyond retrieval: behavior shaping, multi-step task orchestration, and a product-facing layer that handles users, sessions, and policy, independent of any single model provider or deployment choice.

The five layers

This shift produced five layers, each playing a different role.

Layer 1 is the replaceable infrastructure layer. Platform-agnostic, not tied to any single model or feature, cutting across everything else operationally.
Layer 2 is the RAG engine itself, where the retrieval foundations live. It owns retrieval quality, provenance, citation validation, sufficient-evidence checks, and graph reference expansion.
Layer 3 owns behavior shaping, outside source-grounded retrieval. Preference datasets, prompt variants, model adaptation, behavioral eval. Not built yet.
Layer 4 owns multi-step task orchestration and product workflows, the agentic part. It can call the RAG API repeatedly but never imports RAG internals directly.
Layer 5 owns user-facing product access. Authentication, channel adapters, sessions, user mapping, tenant and product routing, product-specific policy. It can call Layer 4 for workflows or go straight to Layer 2 for simple cited Q&A.

This part covers what changed structurally to achieve that separation.

The new structure separates concerns before domain-reference storage:

AI-System/
  rag-engine/          ← retrieval, grounding, citations, evals, storage
  rag-chainlit/        ← one replaceable client
  rag-platform/        ← deployment composition
  behavioral-shaping/
  agentic-orchestration/

The engine owns retrieval, grounding, citations, evals, and storage. Chainlit is a client. Upper layers call Layer 2 through versioned HTTP endpoints, not internal imports. That boundary has to be enforced structurally.

Here is what this part covers at the implementation level:

API Access: versioned routes, Bearer auth, rate limits
Semantic Cache: disabled by default; strict hit conditions for legal retrieval
Query Transformer: passthrough boundary; transformation strategies deferred pending eval evidence
Parent-Child Chunking: small chunks retrieve, large windows generate; enrichment separated from source
Metadata Filtering: scoped retrieval with consistent semantics across all backends
Feedback Learning: bounded signals; no direct mutation path to source truth
Domain Packs: domain-specific logic in sibling repositories; engine owns only the contract
Reference Storage: deterministic references in PostgreSQL and Neo4j; expansion post-fusion pre-reranker

API access

Goal: decouple the engine from Chainlit as the assumed only caller by giving it a versioned, authenticated, rate-limited public API contract.

Routes moved under /v1. I chose versioned routes over unversioned ones to allow breaking changes without forcing all consumers to update simultaneously. The public contract gained Bearer modes and per-consumer rate limits, while local mode remained the default so existing Chainlit behavior remained as is. Per-consumer rate limits went in before the API was opened to anything beyond Chainlit.

I rejected internal library calls as an alternative. Upper layers must call Layer 2 through the versioned HTTP contract. Not imports. Not direct function calls.

Semantic cache

The purpose of this step was to serve repeated identical queries from stored answers. It would be disabled by default because legal retrieval has exact freshness requirements that make premature caching dangerous.

Repeated identical queries against a stable corpus waste tokens and latency. Caching is the obvious fix; until the corpus updates, a document version changes, or a user gets a stale answer to a legal question that has since been amended.

The cache contract has strict hit conditions: cited sources, document version, retrieval config, model identity, language, and access scope must all match. Stale, expired, and negatively-rated entries are treated as misses. One negative feedback event quarantines the exact artifact.

Default is disabled, not as a placeholder but as a deliberate policy for legal retrieval.

I rejected a simple embedding-similarity cache (matching by semantic proximity rather than exact content identity). It cannot distinguish between two queries that are semantically close but legally distinct. Two questions about Article 12 of different decrees could match in embedding space but require completely different evidence sets.

Query transformer

Goal: reserve space for query rewriting strategies without enabling any of them until eval evidence justifies it. Part 1 deferred decomposition and multi-query expansion because exact legal identifiers needed lexical priority. Now this phase makes that rule explicit in the query flow.

Query transformation can improve recall on vague or broad questions. It can also destroy lexical priority for exact legal identifiers, the thing the engine most needs to get right. The QueryTransformerBase boundary exists so the option is available. Passthrough is the default so nothing runs without proof it helps.

I considered three strategies:

HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, retrieve against that
Multi-query: generate query variants, merge results
Decomposition: split compound questions into sub-queries

All three are deferred. Decomposed sub-queries are particularly risky here, because they can lose the exact article identifier that makes a legal query precise. A query for "Article 12 of Decree 18/027 on RCCM registration" split into sub-queries may retrieve Article 12 content without the decree context, or retrieve Decree 18/027 content without the article specificity.

One constraint enforced by design: a transformed query can never replace the original in history, feedback, or citation grounding. The original question is the authoritative record.

Parent-child chunking

Goal: small child chunks as the retrieval unit, with larger parent windows as the generation context. With generated enrichment stored separately and never allowed to become citable evidence.

Retrieval precision and generation quality pull in opposite directions. Small chunks retrieve precisely; large chunks give the model enough context to answer coherently. Parent-child chunking resolves the retrieval-generation tension without merging the two concerns.

Simply increasing chunk size improves generation context but degrades retrieval precision. Parent-child preserves both.

In Part 1 we established the citation rule: the answer can only cite chunks that retrieval actually returned. Parent-child chunking keeps that rule intact while giving the model more context. The retrieved child chunk remains the citable evidence, but the model can also read the larger parent window around it to understand the passage better.

Traces link every matched child chunk to its supplied parent window, so the provenance chain stays intact from retrieval through generation.

Metadata filtering

Goal: scope retrieval by jurisdiction, document type, date range, or source collection; with consistent filter semantics enforced across all retrieval backends.

A query about a 2023 DRC decree should not surface a 2015 ordinance as a top result. Metadata filters let the user or operator scope retrieval without changing the query itself.

The challenge was that Qdrant dense, Qdrant sparse, and BM25 handle missing metadata differently. Without normalization, the same filter produces different results depending on which backend answers the query. That inconsistency silently excludes documents, and you never know what the user isn't seeing.

I defined allowed keys, operators, and missing-metadata semantics explicitly, then normalized them before they reach each backend. Backend parity is enforced at the filter layer, not at the query level.

Access-scope enforcement runs here too. A filter that silently excludes a document the user isn't authorized to see is a security control, not a relevance preference. Those two concerns share the same enforcement point.

Feedback learning

Goal: let feedback flow into bounded, safe signals without creating a direct mutation path to source truth or system behavior.

This phase builds on the feedback interface introduced earlier. A thumbs-up or thumbs-down should not rewrite source truth, change prompts, or alter retrieval defaults by itself. It should create a signal the system can analyze, test, and promote only after evidence supports the change.

Feedback is the only live signal about whether answers are actually useful. This is the first glimpse into a self-learning system.

The safe action policy includes:

cache invalidation
bounded fresh-retrieval policies
eval candidates
failure clustering
operator review
shadow evals

Everything else (mutating source documents, graph facts, prompts, model defaults, retrieval defaults, golden query packs) requires an explicit eval gate. Raw feedback is a weak signal. It creates candidates. Eval evidence and promotion gates decide defaults.

The design keeps feedback and evidence separate: feedback shows what may need improvement; evidence decides what actually changes.

I also added a separate feedback export path. It joins query history with user feedback, then produces local JSONL records for later analysis. The export removes raw prompts, provider payloads, headers, secrets, session keys, full source documents, and uncited chunk text. That gives the upper layers useful feedback data without letting the live engine change itself.

Domain packs

Goal: keep domain-specific extraction logic (rules, normalizers, manifests, fixtures) in separate sibling repositories so the engine stays reusable across domains.

The first corpus is DRC legal documents. If legal extraction logic is baked into the engine schema, the engine works for DRC law and nothing else. Every new domain would require engine-level changes.

My decision was to extract the domain pack contract before writing the deterministic-reference storage layer. The engine owns DomainReference (domain, reference type, raw text, normalized value, source IDs, span, confidence, metadata). The legal pack owns what "article" or "decree" means.

AI-system/
  rag-engine/             ← shared contracts, storage, retrieval, eval
  rag-domain-legal/       ← extraction rules, normalizers, manifests, fixtures
  rag-domain-company-x/   ← sample domain pack scaffolded

I scaffolded sample packs to prove the domain pack contract was genuinely domain-agnostic before committing to it.

The alternative was legal-specific columns in PostgreSQL and legal-specific Neo4j labels. Refactoring before the reference-store schema hardened was cheaper than migrating legal-specific tables and labels later.

Reference storage: PostgreSQL and Neo4j

Goal: give the retrieval pipeline a second expansion path (exact identifier lookup) that vector similarity cannot replicate, backed by PostgreSQL generic tables and a Neo4j graph with generic labels.

In Part 1 , I handled exact-term retrieval with BM25. This step adds a structured reference expansion.

Instead of only matching the phrase "Article 12 of Decree 18/027", I now store extracted domain references and use them to expand retrieval before reranking. BM25 finds the text. Reference storage lets retrieval follow the reference.

PostgreSQL uses generic domain_* table names, no legal-specific columns. Neo4j uses generic labels: Chunk, DomainReferenceEntity, with relationships MENTIONS_REFERENCE, RESOLVES_TO_CHUNK, REFERENCE_EXPANDS_TO. The generic structure is the whole point of the restructure.

Reference expansion inserts after fusion, before reranking:

dense + lexical → fusion → deterministic reference expansion → Neo4j traversal → dedupe → rerank → cited answer

The engine handles forced replacement by deleting old references by doc_id before writing the new ones. It upserts each reference by reference_id, so reruns do not create duplicates. Expansion depth and candidate count stay capped in config, so the reference graph cannot flood retrieval results.

Live smoke caught one thing mocked tests missed: Neo4j does not support nested maps as node properties. Chunk snapshots had to be serialized as JSON strings. That's why infrastructure-touching phases need live smoke.

Live smoke result:

{"doc_id":"smoke-967ef16a7478","neo4j_expanded_chunk_ids":["target"],"neo4j_status":"ok","postgres_expanded_chunk_ids":["target"],"postgres_status":"ok","status":"ok"}

This part achieved a reusable RAG pipeline with domain packs, structured references, safer feedback handling, and a query flow that could evolve without weakening evidence control.

Up next: engine controls - graph retrieval, model changes, cost tracking, eval gates, and the first steps toward domain-pack deployment.

Building a RAG Engine — Part 3: Domain-Neutral Architecture

The five layers

API access

Semantic cache

Query transformer

Parent-child chunking

Metadata filtering

Feedback learning

Domain packs

Reference storage: PostgreSQL and Neo4j

Comments

Building a RAG Engine

Building a RAG Engine — Part 4: Engine Controls

More from this blog

Building a RAG Engine — Part 4: Engine Controls

Building a RAG Engine — Part 2: An Interface to Evaluate It

Building a RAG Engine — Part 1: Retrieval Foundation

Command Palette

The five layers

API access

Semantic cache

Query transformer

Parent-child chunking

Metadata filtering

Feedback learning

Domain packs

Reference storage: PostgreSQL and Neo4j

Comments

Building a RAG Engine

Building a RAG Engine — Part 4: Engine Controls

More from this blog