RAG demos are deceptively easy. You split some documents, embed them, store them in a vector DB, and retrieve the top-k chunks at query time. It works great on your laptop with 50 documents and a handful of test queries. Then you ship it, and reality introduces itself.

I’ve built and operated RAG pipelines across a few different products now, and every single time I’ve been surprised by something that didn’t show up in the prototype. Here’s what I wish someone had told me before I started.

Why Production RAG Is a Different Animal

In a demo, your documents are clean, your queries are predictable, and you test the happy path. In production, users ask questions in unexpected ways, your document corpus grows and gets messy, and the subtle bugs are the kind that don’t throw exceptions — they just quietly return bad answers.

The failure mode isn’t a 500 error. It’s a confident, fluent, plausible-sounding answer that’s wrong. That’s much harder to catch.

Chunking Strategy Matters More Than You Think

The first decision in any RAG pipeline is how to split your documents, and it has an outsized impact on retrieval quality. I’ve tried all the common approaches.

Fixed-size chunking (e.g., 512 tokens with a 64-token overlap) is simple and works tolerably well for dense, uniform text. The problem is it’s completely ignorant of document structure. You’ll regularly split a paragraph mid-sentence, or worse, split a table header from its rows.

Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints — you split where the topic shifts. This produces better chunks for heterogeneous documents, but it’s slower at ingestion time.

Sentence splitting is a useful middle ground for conversational or FAQ-style content. Split on sentence boundaries, then group sentences into chunks until you hit a token limit.

What I settled on is a hybrid: sentence splitting as the base, with a sliding window of overlap, plus a post-processing pass that merges chunks that are too short (less than ~100 tokens) upward into their neighbors.

```python from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=[“\n\n”, “\n”, ”. ”, ” ”, ""], )

chunks = splitter.split_documents(documents)

Merge orphan chunks

def merge_short_chunks(chunks, min_tokens=100): merged = [] buffer = "" for chunk in chunks: buffer += ” ” + chunk.page_content if len(buffer.split()) >= min_tokens: merged.append(buffer.strip()) buffer = "" if buffer.strip(): merged[-1] += ” ” + buffer.strip() if merged else buffer.strip() return merged ```

The overlap is non-negotiable. Without it, a query that straddles two adjacent chunks will fail to retrieve a good answer for one of them.

Embedding Model Drift

This one bit me hard. After several months in production, I swapped out the embedding model for a newer, better one. Benchmarks looked great. I re-embedded the new documents going forward, but left the existing index alone to save time. The results got worse.

The problem is obvious in hindsight: query embeddings from model B are not meaningfully comparable to document embeddings from model A. Cosine similarity between vectors from different embedding spaces is essentially noise.

The rule is simple: when you change your embedding model, you must re-embed your entire corpus and rebuild the index from scratch. There’s no shortcut. Plan for this operationally — keep a record of which model version was used for each document, and build a re-indexing job you can run on demand.

```python

Store model version with each document at ingestion time

def ingest_document(doc, vector_store, embedder, model_version: str): embedding = embedder.embed(doc.content) metadata = { **doc.metadata, “embedding_model_version”: model_version, } vector_store.upsert(id=doc.id, vector=embedding, metadata=metadata) ```

Vector Database Gotchas

The choice of similarity metric matters and is easy to get wrong. Most embedding models produce unit-normalized vectors, which means cosine similarity and dot product are equivalent — but only if your vectors are actually normalized. If you’re using a model that doesn’t normalize by default, dot product scores will be dominated by vector magnitude rather than direction, and your retrieval quality will degrade in confusing ways.

Always check your embedding model’s documentation. When in doubt, normalize explicitly:

```python import numpy as np

def normalize(v: list[float]) -> list[float]: arr = np.array(v) return (arr / np.linalg.norm(arr)).tolist() ```

Index tuning is another area where defaults will hurt you at scale. Most hosted vector DBs use HNSW indexes with default parameters that work fine for small collections. Once you’re past a few million vectors, you need to tune the build-time accuracy/speed tradeoff and the number of neighbors per node. Profile this with your actual data distribution, not a benchmark.

Filtering is deceptively expensive. If you’re filtering by metadata at query time (e.g., only retrieve documents from a specific tenant or date range), the filter is often applied post-retrieval on the raw candidates. That means you need to over-fetch — retrieving top-50 to get a reliable top-5 after filtering. Some vector DBs support pre-filter indexes, which is a significant quality-of-life improvement for filtered workloads.

Observability: Knowing When RAG Is Failing

The hardest part of operating a RAG system is that bad retrieval often looks like a model problem from the outside. A user says “the AI gave me the wrong answer,” but you don’t know whether the retrieval returned the wrong chunks, the right chunks got ignored, or the LLM hallucinated despite having the right context.

I instrument at three layers:

Retrieval layer — log the query, the retrieved chunk IDs and scores, and the top chunk text. If scores are consistently low (below ~0.70 cosine for most models), you have a retrieval problem, not a generation problem.
Context window — log what actually got passed into the LLM prompt. Sometimes chunks get dropped due to token limits in ways that aren’t obvious.
Generation layer — use an LLM-as-judge to periodically score whether the answer is grounded in the retrieved context.

```python def log_retrieval_event(query, chunks, scores): logger.info({ “event”: “rag_retrieval”, “top_score”: max(scores), “mean_score”: sum(scores) / len(scores), “chunk_count”: len(chunks), “low_confidence”: max(scores) < 0.70, }) ```

A rising rate of low-confidence retrievals is your early warning system. It usually means your index is stale, your chunking doesn’t match your query distribution, or a recent document ingest introduced noise.

Lessons Learned

After running this in production for a while, here’s the short version:

Re-embed everything when you change embedding models. No exceptions.
Tune overlap aggressively — most people use too little.
Normalize your vectors and pick the right similarity metric for your model.
Log retrieval scores from day one. You can’t debug what you can’t see.
Treat retrieval and generation as separate problems. Debug them independently before blaming the LLM.

RAG in production is an infrastructure problem as much as an ML problem. The teams who succeed are the ones who instrument early, plan for re-indexing, and build evaluation into the deployment pipeline — not as an afterthought.