Building a RAG Engine — Part 4: Engine Controls
The previous part redesigned the entire system into a domain-neutral engine structure with reference storage.
Graph extraction was one of the planned capabilities for the engine: for a start, basic knowledge graphs to feed retrieval with relationships beyond dense and deterministic lookups. Implementing it introduced LLMs into the extraction path, which in turn introduced cost, provider management, evaluation, and replacement concerns.
For graph extraction to work, I needed LLMs to generate the entities and relationships. But LLM extraction is token heavy. Model configurability was already in place. The next issue was cost control. Once graph extraction started using LLM calls, I needed better visibility into token usage, pricing, and provider-level cost behavior. The point was not just to support different models, but to keep my hands on what those models were costing the engine.
I already have Llama running on another machine, and I expect local models to become part of the setup once my main hardware is upgraded. Cost control would work across hosted and local providers, even when the cost model is not the same.
Here is what this part covers:
LLM Graph Extraction: entities and relationships extracted from source chunks into validated internal models
Graph Query Planner and Traversal: bounded, transparent scoring; five explicit fallback conditions
Provider Registry: typed startup rejections; credentials validated by name, never by value
Capability Metadata: provider capabilities declared from config; unsupported requests fail at the boundary
Cost and Usage Tracking: cached token pricing; missing pricing handled gracefully
Eval Gates: non-zero exit on regression; explicit acceptance required for any deviation
Provider Replacement Smoke: dry-run path confirming no application code changes required
Domain Pack Deployment: pack enablement as a config and composition decision alone
LLM graph extraction
Goal: extract entities and relationships from source chunks into validated internal models, with every inferred record carrying full provenance before anything reaches Neo4j.
Dense and deterministic retrieval find what's explicitly present. They don't surface implicit connections between legal concepts (e.g. an amendment chain, a cross-reference between decrees) that aren't stated as exact identifiers. LLM extraction makes those latent relationships traversable.
I chose a LightRAG-style extraction interface instead of building a custom extraction pipeline from scratch. LightRAG extracts entities and relationships per chunk, stores with provenance. This approach maps directly to the requirement without extra complexity.
The extraction runs on gpt-4.1-mini for now with a constrained budget: max 10 chunks per run, 4000 tokens, batch size 2 on the first run. LLM extraction is expensive relative to retrieval.
The critical constraint applied on top: raw LLM output never writes directly to the graph. Provider output gets validated into internal GraphEntity and GraphRelationship models first - a validation layer before any Neo4j writes. Every inferred record carries a full provenance chain: document ID, chunk ID, confidence bucket, extraction run ID, model, prompt hash, and source type.
Storage types are explicit:
Deterministic references →
deterministic_referenceLLM-inferred data →
llm_inferred_graphLow-confidence inferred data → stored with
retrieval_eligible: false
That last type matters. Low-confidence data exists for diagnostics, not for expanding retrieval. Without provenance on every record, there is no way to audit why a traversal path was taken or reject a path that shouldn't have run.
Graph query planner and traversal
Goal: ensure traversal only runs when there's a reasonable expectation it will find something the baseline missed; then falls back cleanly when it can't.
Graph traversal on every query would be expensive and slow, with no guarantee for correctness. A simple factual question doesn't benefit from multi-hop relationship discovery. The planner decides first.
I defined five fallback conditions explicitly:
simple questions
ambiguous entity matches
missing graph entities
low-confidence matches
single-entity queries
Each condition represents a case where traversal either can't help or would introduce unreliable candidates.
Traversal scoring uses a transparent formula:
path_score = mean_edge_confidence / path_length + relationship_bonus
We chose a transparent formula over a learned scoring model deliberately. A legal retrieval system cannot afford opaque path scoring. If a traversal path produces a wrong answer, I need to see why it scored the way it did.
Depth, path count, candidate count, per-seed limits, and confidence floor are all capped by config. The graph adds candidates to the retrieval set. It cannot flood it or bury exact lexical hits.
That is the same rule I used for deterministic reference expansion in Part 3: expansion is allowed, but it cannot take over retrieval.
The traversal comparison was where the design paid off. A deterministic-only retrieval path didn't surface the expected target chunk in the test fixture. Deterministic plus LLM-inferred traversal found it at recall@1 = 1.0. A low-confidence unrelated edge stayed ineligible and didn't appear.
The graph found something the baseline missed. It did not lower the bar for what counts as evidence.
One point worth being explicit about: graph traversal is in the retrieval path, not the citation path. Every answer still cites source chunks. The graph expands what chunks are considered.
Graph traversal comparison:
{
"deterministic_only": {
"retrieved_expected": false,
"recall_at_1": 0.0
},
"deterministic_plus_llm_graph": {
"retrieved_expected": true,
"recall_at_1": 1.0,
"candidate_scores": [
{
"chunk_id": "graph-target",
"edge_confidence": 0.92,
"path_score": 0.92,
"rank": 1,
"accepted": true,
"source_type": "llm_inferred_graph"
}
]
},
"negative_control": {
"chunk_id": "graph-unrelated",
"relationship_confidence": 0.2,
"retrieval_eligible": false,
"retrieved": false
}
}
Diagnostic feedback lets me mark paths as useful or harmful. Still, that feedback is observational — it cannot delete graph facts, mutate references, suppress source chunks, or change citation validation.
This continues the traceability rule from the earlier parts. Part 1 needed citation traces. Part 2 needed conversation and feedback traces. Here, graph traversal needed path traces, so every expanded candidate could be explained.
I also added Neo4j graph quality checks under system status: orphan entities, dangling edges, stale document versions, missing provenance, low-confidence relationship counts, deterministic-versus-inferred graph size, latest extraction run, and cleanup status. I did not treat graph traversal as ready just because it returned a better candidate. It also needed health signals and cleanup tests showing stale graph paths could be removed safely.
Provider registry
Goal: give every embedding and generation model a single typed registration shape derived from deployment config, so every assumption about provider capabilities and credentials is explicit and testable at startup.
The registry makes every assumption explicit: model capabilities, credential locations, endpoint formats.
The registry derives from deployment config. It is a typed view over existing config. Both embedding and generation models register with the same shape: provider, model, hosted/local mode, enabled state, credential env var name, endpoint, dimension, context limit.
Three typed startup rejections were specified:
a disabled active embedder
a missing default generation model
a local provider without an endpoint
These fail at startup, not as vague runtime errors when a query finally hits the missing component.
The credential boundary is enforced by design: credentials are validated by environment variable name only. The value is never read, printed, logged, or returned in any response. I considered a service-discovery pattern where providers register themselves at runtime. I rejected it. It adds complexity and makes startup validation harder, which is the opposite of what operational reliability requires.
Capability metadata
Goal: let providers declare what they can do from config so unsupported requests fail explicitly at the boundary.
A provider that doesn't support structured output, streaming, or prompt caching will fail eventually. The question is whether that failure happens early and clearly, or late and cryptically. Capability metadata moves the failure to the earliest possible point.
Capabilities are declared from config, not discovered by calling the provider. Capability discovery calls add latency, cost, and a dependency on provider availability at startup. The tradeoff: config can fall out of sync with actual provider capabilities. That's why capability losses are explicitly surfaced in the provider replacement smoke - if a candidate provider doesn't support something the current one does, the smoke flags it before any change is committed.
Upper layers can inspect capabilities without making any provider calls. Unsupported requests return a typed error at the boundary, not a vague exception from somewhere downstream.
Cost and usage tracking
Goal: record token counts and cost estimates on every model call - with missing pricing handled gracefully so cost visibility never blocks an answer.
Without cost visibility, a RAG system running in production generates a large bill that is hard to track. Understanding which queries are expensive, which models are cost-efficient, and where cached tokens are reducing costs requires structured usage records on every call.
Cached input tokens can be priced separately from fresh input tokens. The pricing config supports that shape:
openai/gpt-4.1:
input_per_million: 2.00
cached_input_per_million: 0.50
output_per_million: 8.00
Provider-reported model ID is stored alongside token counts. The model I intended to call and the model the provider actually used can differ. Model aliases and version routing make this a real operational concern, not a theoretical one.
If pricing metadata is missing, the request succeeds and the usage record is marked unknown. Cost visibility is an operational improvement. It will not block an answer.
Historical cost records are never mutated when pricing config changes. The pricing version is stored with each record so historical estimates remain auditable.
Eval gates
Goal: make any default model change require a before/after eval run, so degradation becomes visible and blocked unless explicitly accepted.
A generation model swap that improves latency or reduces cost while subtly degrading citation hit rate is undetectable without measurement. Eval gates make that degradation visible before it reaches the point where I'd have to discover it the hard way.
python -m evals.gates `
--config config.yaml `
--report evals/reports/CANDIDATE.json `
--baseline-report evals/reports/BASELINE.json `
--output evals/reports/CANDIDATE_GATE.json
The gate compares a candidate eval report against a baseline across retrieval metrics, citation hit rate, insufficient-evidence correctness, latency, and estimated cost. It exits non-zero on regression unless I explicitly accept it with a reference report.
That explicit acceptance requirement is deliberate. A regression that is accepted is a decision on record. Not an accident that went unnoticed.
The gate report carries the metadata needed to audit the run: default generation provider and model, active embedding provider with model and dimension, golden-query pack path, and a secret-redacted config hash.
Embedding changes are held to a stricter standard than generation changes. Changing embedding provider, model, or dimension is a full re-indexing event. The engine cannot point a new embedding model at an index built for a different dimension. The embedding re-indexing gate enforces this.
Provider replacement smoke
Goal: prove that a provider swap is expressible through config and operations workflow alone — with visible results before any change is committed.
The dry-run path checks current versus candidate provider identity, hosted/local mode, credential env var, capability losses, pricing metadata gaps, rollback option, and embedding re-indexing consequences.
Both a hosted and a local generation swap confirmed the same result:
status: ok
application_code_change_required: false
Domain pack deployment
Goal: prove enabling a new domain pack locally is a config and composition decision, with no engine source code changes required.
If adding a second domain requires touching engine code, the domain-neutral architecture from Part 3 has failed.
A pack is enabled through config: path, module, factory, manifest, fixture path, domain, version, reference types, languages.
The composed smoke loaded the configured packs through the generic loader:
pack_count: 2
domains: legal_regulatory, company_x_kb
Each pack proved its own extraction behavior independently. The engine proved it could load and report on both generically.
Legal/regulatory remains the first production pack. Other domain packs remain interface-proof packs until product scope requires them.
Frozen domain-pack composition evidence:
Legal pack tests: 1 passed
Company X KB pack tests: 1 passed
Engine contract/config/storage/eval tests: 32 passed
Configured sibling import smoke:
{'pack_count': 3, 'domains': ['legal_regulatory', 'company_x_kb'], 'versions': ['legal_regulatory.deterministic.v1', 'company_x_kb.deterministic.v1']}
Pack lint checks: All checks passed!
Rollout rules are explicit. Two categories of change require different responses:
Re-extraction and eval required:
Pack rule changes
Normalizer changes
Manifest version changes
Reference type changes
Language changes
Graph rebuild may be required:
Target-resolution changes
Traversal policy changes
The distinction between re-extraction events and graph rebuild events matters. Conflating them either over-triggers expensive rebuilds or under-triggers necessary ones.
This part achieved engine controls for graph retrieval, model changes, cost tracking, eval gates, and domain-pack deployment.
Up next: finishing the first domain pack and ingesting a real corpus into it.