RAG (retrieval-augmented generation) is the pattern where you give an LLM context from your own documents instead of hoping it memorized the right facts. Chunk your documents, embed the chunks, retrieve the relevant ones, and let the model generate answers grounded in your data.
The concept is simple. Shipping a system that works is not.
Most RAG tutorials make it look easy: embed some PDFs, wire up a vector database, prompt your model. But the gap between demo and production? That's where teams lose months. Research from IEEE's CAIN 2024 conference identified seven distinct failure points in RAG systems: missing content (the answer isn't in your documents), missed rankings (the answer exists but retrieval didn't surface it), context consolidation limits (too many results for the model to synthesize), extraction failures (model has context but can't pull out the answer), wrong format, incorrect specificity, and incomplete answers despite complete context.
The uncomfortable finding: RAG robustness evolves rather than getting designed in.
You can't fully validate the system until you're running it against real queries. The researchers recommend accepting this and building for iterative improvement, not upfront perfection.
Chunking Determines Everything Downstream
Get chunking wrong and no amount of model tuning will save you.
The core tension: smaller chunks match queries more precisely but lose surrounding context. Larger chunks preserve relationships but dilute the relevance signal that helps retrieval. Weaviate's engineering team benchmarked nine different chunking strategies, from naive fixed-size to LLM-powered semantic chunking. Their recommended baseline for fixed-size chunking: 512 tokens with 50-100 token overlap. The overlap matters; it prevents information from getting lost at chunk boundaries.
But fixed-size chunking ignores document structure entirely. Tables get split mid-row. Paragraphs break mid-thought.
Semantic chunking fixes this by using natural language boundaries to create coherent chunks. According to studies cited in Weaviate's analysis, semantic chunking achieves faithfulness scores of 0.79-0.82 compared to 0.47-0.51 for naive fixed-size approaches. That's a 60% improvement in faithfulness, not a marginal gain. IBM's enterprise RAG guidance adds another consideration: agentic chunking that uses intelligent sizing and overlap to prevent mid-table and mid-sentence splits. This matters more than most teams realize, especially for structured data like financial tables or technical specifications.
Our read: start with 512-token fixed chunks as a baseline, then layer semantic chunking once you have a working pipeline. Don't skip to fancy chunking strategies until the simple one fails in measurable ways.
Retrieval Has Systematic Blind Spots
Vector similarity search is the default, but it misses things. Semantic embeddings don't catch keyword matches that matter. The query "XGBoost hyperparameter tuning" might not retrieve a document about "gradient boosting parameter optimization" even though they're describing the same thing.
Hybrid search solves this by combining two retrieval methods: BM25 (traditional keyword matching based on term frequency) and dense embeddings (semantic similarity). The technical approach uses Reciprocal Rank Fusion to merge results from both systems: each candidate gets a score based on its position in each ranking, and the combined scores determine final ordering. This sounds complex, but most vector databases now offer hybrid search as a built-in feature. If you're not using it, you're leaving precision on the table.
We covered embedding limitations in depth in How Embeddings Work (and When They Break). The short version: cosine similarity has systematic biases, embedding drift is real, and model migrations break indexes. RAG inherits all of these problems.
Initial retrieval casts a wide net. You might retrieve 50-200 candidates using fast approximate methods, but many will be marginally relevant at best. Reranking uses more expensive models to sort these by actual relevance to the query, and it's where good RAG systems separate from great ones.
Cross-encoders like MiniLM (trained on the MS MARCO dataset) score query-passage pairs directly. They're more accurate than embedding similarity because they see query and document together, but they're slow. You can't run a cross-encoder against your entire index; you rerank a shortlist. Late interaction models like ColBERT encode query and document separately, then compute token-level similarity. More efficient than cross-encoders while still capturing query-document relationships that bi-encoder retrieval misses.
Both approaches improve NDCG and MRR metrics significantly. If you're building production RAG and skipping reranking, you're optimizing the wrong part of the pipeline.
Scale Introduces Problems the Tutorials Skip
Context window limits mean you can't just stuff 200 retrieved chunks into a prompt. You need consolidation strategies: summarizing chunks, selecting the most relevant, or structuring multi-turn conversations where the model asks for more context.
IBM notes that complex entity relationships break naive RAG. If your documents describe relationships between entities (org charts, supply chains, regulatory hierarchies), you may need GraphRAG, which adds a knowledge graph layer to capture relationships that flat document retrieval misses.
And then there's security: the sleeper problem. Enterprise data has access controls. When you chunk and embed documents, you need to preserve those controls in retrieval. Otherwise your RAG system becomes a data leak; users query for information they're not authorized to see, and the system happily retrieves it.
The Pattern That Ships
- Start with fixed chunking (512 tokens, 50-100 overlap) and validate retrieval quality with real queries
- Add hybrid search (BM25 + embeddings) to catch keyword matches that pure semantic search misses
- Implement reranking once retrieval volume justifies the latency cost
- Layer semantic chunking for document types where structure matters (tables, code, legal text)
- Build evaluation loops: track retrieval quality, generation quality, and user feedback systematically
The research is clear on one point: you cannot design RAG robustness upfront. It emerges from running the system against real queries and fixing what breaks. Teams that accept this build better systems than teams chasing architectural perfection before launch.