Why your RAG demo works and your RAG product doesn't

A teardown of the five places retrieval quietly breaks between demo and production — chunking, stale indexes, near-duplicate context, recall vs. answer quality, and 'fine on 10 questions' — each with the fix.

The demo answers ten questions about twelve clean PDFs with impressive precision. Nobody in the room asks what happens on question eleven, or what happens when the PDFs change.

Production answers those questions. And ten thousand others.

Five failure modes cause most of what goes wrong between demo and production. Each one is invisible on a handful of curated inputs and completely predictable at scale.

Failure mode	Symptom in production	Fix
Naive chunking	Answers miss context that's "right there" in the source	Structure-aware chunking with overlap
Stale index	Correct on last month's data, wrong on today's	Incremental re-embedding pipeline
Near-duplicate context	Same answer, five slightly different phrasings, no real substance	MMR / diversity-aware retrieval
High recall, wrong answer	Right doc retrieved; answer still wrong	Cross-encoder reranking + end-to-end evals
No eval set	You don't know when it regresses	Real eval set, Ragas metrics, CI gate

Chunking: fixed-size breaks meaning

The demo uses documents that happen to fit cleanly into 512-token chunks. That's almost never true in a real corpus.

A fixed-size chunker doesn't know about paragraphs, headers, or sentences. It counts tokens and cuts. The result is chunks that start mid-sentence and end mid-clause — semantically blurry in a way that's invisible to you and very visible to the embedder. A chunk about quarterly revenue that opens with "...as compared to Q3 projections, the" embeds differently from the full sentence. The retriever has to find signal in noise.

Demo documents tend to be short, well-structured, and single-topic per page. Those conditions hide chunking problems entirely. Feed in a 200-page technical manual or a corpus of support tickets and the seams show up everywhere.

The fix: let document structure determine boundaries. Split on headings first, then paragraph breaks, then sentence ends — not token counts. Add overlap (a paragraph from the previous chunk, typically 10–15% of chunk size) so continuity doesn't die at the cut. For code, chunk at function and class boundaries. For long documents where both coarse and fine retrieval matter, consider parent-child chunking: embed document-level summaries separately from fine-grained passages, use the summary layer to locate the right section, then retrieve the passage. LlamaIndex calls this pattern by that name if you want something to search.

Stale indexes: point-in-time confidence

The demo embeds once and the documents don't change. Production is neither.

Knowledge bases drift. Last week's pricing table replaces this month's. A policy changes and a dozen pages need re-embedding. A new release makes half the API documentation wrong. A vector index is a point-in-time snapshot — when the source changes and the index doesn't, retrieval confidently returns outdated information. The similarity score looks fine. The chunk looks authoritative. The user gets last month's answer.

There's no error signal. Stale retrieval fails silently.

Running the indexing script once isn't production infrastructure. Production needs a refresh pipeline: a mechanism to detect document changes and re-embed the affected chunks. For frequently-changing documents — pricing tables, API references, policy pages — trigger re-embedding on change via webhook from your document store or CMS. For stable archives, a weekly scheduled re-embed keeps drift manageable. Either way, store a last_embedded_at timestamp as chunk metadata. Both pgvector and Qdrant support metadata filtering; use it to deprioritize stale chunks when freshness matters to the query.

One detail that trips teams: deletion is harder than insertion. When a document updates, delete the old chunks before indexing the new version. Skip that step and both versions sit in the index — and the retriever picks whichever scores higher, often the stale one.

Near-duplicate context: when top-k retrieves one thing five times

A small, diverse corpus makes top-5 retrieval naturally varied. Fifty thousand documents on a shared domain do not.

At scale, any frequently-covered topic has dozens of documents covering roughly the same ground. A product FAQ, three blog posts, a support article, and a press release all answer "how does the refund policy work?" with slightly different phrasing. Similarity search finds the most relevant chunk, then the next-most-relevant — which is nearly identical. The context window fills with one piece of evidence in five costumes. The model has no breadth. The chunk that held the edge case or the exception never made it into the top-k.

Maximal Marginal Relevance (MMR) addresses this directly. MMR selects each next result by trading off relevance against similarity to already-selected results. A λ of 0.5 balances the two; push toward 1.0 for more relevance, toward 0.0 for more coverage. LangChain, LlamaIndex, and the Qdrant client all expose MMR as a retrieval mode alongside standard similarity search. Switching it on is usually a one-line change.

Recall vs. answer quality: two different measurements

One failure catches teams off guard late in the process: the right document was retrieved, and the answer is still wrong.

Retrieval recall measures whether the relevant chunk appeared somewhere in your top-k results. That's necessary. It's not sufficient. Generation is a separate step with separate failure modes. A model can retrieve the right chunk and then misinterpret it, hallucinate a detail that feels consistent with the context, or ignore it in favor of a more confidently-worded but incorrect chunk ranked higher. High recall means the evidence was present. It doesn't say anything about whether the model used it correctly.

This is the reason the 2026 open-source AI stack breaks retrieval into a three-step sequence — "chunk well, search hybrid, then rerank" — rather than one. Hybrid search (vector similarity combined with BM25 keyword matching, with Reciprocal Rank Fusion to merge the two ranked lists) improves recall on queries where lexical matching outperforms embeddings: exact product names, version strings, specific identifiers that dense vectors tend to blur. Reranking with a cross-encoder then re-scores the shortlist with a model that reads query and document together, producing a much more accurate ordering than cosine distance alone. The sentence-transformers CrossEncoder class is the open-source path; Cohere Rerank is the commonly-used API option.

But reranking is still a retrieval fix. Measuring end-to-end answer quality means measuring the full pipeline. Ragas provides the RAG-specific metrics: faithfulness (does the answer make claims the retrieved context actually supports?) and answer relevance (does the answer address what was asked?). A system with strong retrieval recall and weak faithfulness has a generation problem. The reverse has a retrieval problem. One blended "accuracy" number tells you something is wrong but not where those two failures split.

"Fine on 10 questions": the demo is the test set

The deepest problem isn't any of the four above. It's that without a real eval set, you won't know when any of them regress.

When the demo serves as validation, you've tested exactly the questions you tested when building the demo — by definition, the inputs the system handles well. Any chunking change, embedding model swap, or prompt edit afterward gets tested on those same ten questions. You're updating a system you can't measure.

A change that helps on ten demo questions might hurt on the actual distribution of production queries. A chunk-size tweak that improves retrieval on your FAQ corpus might degrade it on long-form documents. Without a real eval set, you find out from user complaints.

The field guide to AI coding agents frames this as the core question for any AI system: "can it tell, on its own, whether it succeeded?" A RAG system with no eval harness can't answer that. A prompt edit, a new document type entering the corpus, a retrieval configuration change — any of these can silently degrade quality. You need a ground truth to move toward.

The cache

A few things worth keeping, the way we keep everything here — techniques, not endorsements.

Fix chunking before optimizing anything else. Every downstream component — the embedder, the retriever, the LLM — sits on top of chunk quality. A bad chunk is a bad retrieval unit, and no downstream step fully recovers from it. Tune the chunk boundaries before you swap in a better embedding model.
An eval set is not optional once you have real users. The demo is not a test suite. Build 100+ question-answer pairs, cover the adversarial cases, run the suite in CI on every change. Ragas handles the RAG-specific scoring; plug it alongside your regular test suite and treat a score drop as a blocking failure.
Measure retrieval recall and answer quality separately. Faithfulness and answer relevance are different failure modes with different fixes. One blended number that combines them tells you something is wrong but not where. The 2026 open-source AI stack covers the supporting tooling: pgvector or Qdrant for storage, Postgres full-text search for the BM25 side if you're already there, Langfuse for tracing the full pipeline.
Hybrid search and reranking are the baseline. Pure vector search doesn't hold up against real query distributions — exact names, identifiers, quoted phrases need lexical matching. A cross-encoder reranker is cheap to add on top of any existing retrieval pipeline and materially improves the final ranking.

None of this is complicated. It's the work the demo made look optional.