RAG Evaluation Interface with Chainlit

Part 1 ended with a pipeline operated fully from the command line. I could ingest documents, run a query, and get an answer wired to evidence. What I didn't have was the UI.

I needed an interface to test what I'd built the way I'd actually use it. Ask something. Read the answer. Check where it came from. Decide if it held up.

Here is what this part covers:

Chainlit as the UI Layer: a chat interface wired to the FastAPI query and feedback endpoints
Session Controls: per-session configuration scoped to the session and never persisted
Citation Display: raw chunk IDs and backend references translated into human-readable source entries
Provider Error Handling: provider failures returned as plain-language recovery messages
Operator Status: system state surfaced from stored records without touching retrieval or generation
Conversation History: two rounds of persistence design, ending at native thread integration

The UI layer with Chainlit

I needed an interface I could actually use to evaluate the pipeline, and I didn't want to build a frontend from scratch to get there.

I chose Chainlit over a custom React frontend because the integration surface is minimal. Message handlers map directly to API calls. It ships with thread persistence, feedback controls, and a sidebar natively. Those are hours of frontend work I didn't have to do.

The tradeoff is limited control over the visible layer. A custom frontend would have given me full control earlier. I accepted the tradeoff and paid for it in the product pass later.

Wiring it to FastAPI did not take long. Messages go to POST /query. Feedback goes to POST /feedback. Follow-up context is bounded. Conversation text can help reframe a question, but it cannot become retrieved evidence. The retrieval corpus and chat context stay separate, so a citation always traces back to something actually in the corpus, not to something said earlier in the conversation.

UI/UX issues

The first few uses exposed issues, starting with ambiguity around "not enough evidence". Sometimes the corpus genuinely did not have what I was asking for. Other times something upstream had failed and got reported the same way. Those are different problems, one a gap in the data, the other a bug, and collapsing them into one message meant every occurrence sent me back to investigate from scratch.

For context, the first corpus I built was a small set of RAG documentation and papers I found online. It was intentionally limited for testing.

Examples of issues found:

Citations showed raw chunk IDs: chunk-abc-p3-c1 is an internal identifier. I'd read an answer and have no quick way to confirm whether the source actually backed it up.

Source links exposed backend JSON: clicking a reference opened the raw API response, a payload I'd have to parse to find the actual source.

Provider outages leaked stack traces: an unhandled exception from OpenAI rendered straight into the chat window.

Latency was noticeable: around four seconds per answer. Not bad for a local setup, but not invisible either.

"What is RAG?" was rejected as too ambiguous: the corpus had relevant material. The query planner flagged it as too broad for the evidence validator. Technically defensible. Still frustrating to sit through as the person asking.

"New chat" felt destructive: conversations persisted in PostgreSQL, but the interface gave me no way back to a previous one. Every restart meant starting cold.

Default Chainlit branding: the app looked like a Chainlit demo.

French labels exposed encoding problems: accented characters in source titles broke.

One more change: reasoning traces and tool output could not appear in the normal chat path. It is useful for debugging. Not something that belongs in front of a user trying to read an answer.

These were not retrieval failures. Which was a good thing. The citations were correct, and the pipeline returned relevant material. The problems were in the interface: visibility, error handling, source review, session continuity, and polish.

Session controls

The aim was to let internal users configure their own provider setup, model, API key, local endpoint, answer language - without persisting anything sensitive.

Different users run different providers. Hardcoding a model or API key into the deployment makes the tool unusable for anyone not on that setup. Session-scoped configuration keeps the system flexible.

API key handling is a security constraint, not a UX preference. Keys are masked at input and exist only for the session duration. They are never written to storage. Not to PostgreSQL, not to logs, not to traces. The key is cleared when the session ends.

I implemented usage details (model used, token counts, estimated cost) as an opt-in view. They come from stored records. Viewing them never triggers a model call.

I wired in both Gemini and OpenAI API keys so I could switch between them and compare answer generation. I haven't actually done that comparison yet. But the capability is there. The evidence for which one is better for this corpus isn't, not so far.

Citation display

This step translated internal retrieval identifiers into numbered, human-readable source entries before the user ever sees them.

A citation is only useful if it can be verified. chunk-abc-p3-c1 does not help anyone. The citation display layer translates each chunk reference into document name, section, page, and a readable preview - enough to understand where the answer came from and find the source.

Source links open a reference detail panel, not the raw API JSON. The panel shows what is needed to verify a source: the document, the relevant excerpt, and the citation context.

Provider error handling

The next step was to make provider failures UI-safe: catch the exception at the API boundary, return a structured error, and render a plain-language recovery message without exposing internal details.

An OpenAI outage mid-session should produce a message like "The generation provider is currently unavailable. You can try switching to a local model in session settings." Not a Python exception.

The recovery message gives enough context to act on: switch models, adjust settings, or try again.

Stack traces, provider names, and HTTP status codes are still logged internally.

Operator Status

I needed a way to check index state, model availability, and evaluation state without triggering retrieval or generation.

Debugging a deployment by accidentally running a real query against it defeats the point of checking status in the first place.

The status command is a read-only operational path. It summarizes corpus inventory, index state, configured models, and eval report state without triggering retrieval, generation, embedding, or billed model calls.

This constraint is enforced by design. There is no code path that allows the status command to call an embedding provider or LLM.

Status reporting also stays separate from the chat surface. Asking a question, inspecting a cited source, or giving feedback uses one path. Checking deployment state uses another. Keeping them apart means a status check can never accidentally show up as a real query in the system's own usage history.

Conversation history, two rounds

The aim: make prior conversations accessible across sessions without letting conversation context become retrieved evidence.

Round 1

The first version had a real problem: I had no way to see past conversations. They were saved on the backend, but they weren't tied to a session, so the frontend had nothing to retrieve them by. Every time I refreshed or restarted, I was starting from zero, even though the data was sitting right there in PostgreSQL.

I tried a custom dropdown selector on top of that PostgreSQL store. It worked technically. It pulled the right records. It still felt disconnected, because a dropdown isn't how chat history behaves anywhere else, and starting a new chat still felt like losing the thread of prior work entirely.

Round 2

The second pass integrated Chainlit's native thread persistence while keeping PostgreSQL authoritative:

Chainlit thread IDs map to API conversation IDs
on_chat_resume rebuilds bounded context from stored message steps
History appears in the native left sidebar, the same pattern I'm used to from any chat interface

One operational catch: persisted native threads require an authenticated Chainlit user. Anonymous sessions could still chat, but native thread endpoints returned unauthorized without Chainlit auth enabled.

The constraint preserved across both designs: conversation history can help reframe a follow-up question, but it is not retrieved evidence. The chat context never enters the retrieval corpus.

Session API keys remain available only during the active session. They are not written to storage.

This part achieved a usable evaluation interface with conversation history, readable citations, feedback capture, session controls, and operator checks.

Up next: adding graph retrieval, then running live retrieval tests across the full system.

Building a RAG Engine — Part 2: An Interface to Evaluate It

The UI layer with Chainlit

UI/UX issues

Session controls

Citation display

Provider error handling

Operator Status

Conversation history, two rounds

Round 1

Round 2

Comments

Building a RAG Engine

Building a RAG Engine — Part 3: Domain-Neutral Architecture

More from this blog

Building a RAG Engine — Part 4: Engine Controls

Building a RAG Engine — Part 3: Domain-Neutral Architecture

Building a RAG Engine — Part 1: Retrieval Foundation

Command Palette

The UI layer with Chainlit

UI/UX issues

Session controls

Citation display

Provider error handling

Operator Status

Conversation history, two rounds

Round 1

Round 2

Comments

Building a RAG Engine

Building a RAG Engine — Part 3: Domain-Neutral Architecture

More from this blog