Beyond Naive RAG
Retrieval-Augmented Generation has become the standard pattern for grounding LLM responses in factual data. But the gap between a tutorial RAG demo and a production enterprise system is enormous. Naive RAG — embed documents, find nearest vectors, stuff into prompt — works for demos but fails in enterprise settings due to poor recall, hallucinations from irrelevant context, and inability to handle complex queries.
This article walks through the architecture patterns that bridge that gap, from foundational improvements to advanced techniques we deploy at MBB AI Studio.
Pattern 1: Chunking Strategy
How you split documents matters more than which embedding model you use. Common chunking mistakes include:
- Fixed-size chunks with arbitrary boundaries that split sentences and lose context
- Chunks that are too small (< 100 tokens) losing semantic meaning
- Chunks that are too large (> 1000 tokens) diluting the relevant information
Effective chunking strategies for enterprise documents:
Semantic chunking — Split based on topic shifts detected by embedding similarity between consecutive paragraphs:
from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(documents)Document-structure-aware chunking — Respect document hierarchy (headers, sections, lists). For PDFs, use layout-aware parsers like Unstructured or DocLing that preserve heading structure:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="report.pdf",
strategy="hi_res",
chunking_strategy="by_title",
max_characters=1500,
combine_text_under_n_chars=200,
)Parent-child chunking — Index small chunks for precise retrieval but return the parent (larger) chunk for context. This gives you the best of both worlds: high-precision matching with sufficient context for generation.
Pattern 2: Hybrid Retrieval
Vector similarity alone misses exact keyword matches, acronyms, and domain-specific terminology. Hybrid retrieval combines dense (vector) and sparse (keyword) search:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Dense retriever (vector similarity)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Sparse retriever (BM25 keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents, k=10)
# Combine with reciprocal rank fusion
hybrid_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # Tune based on your data
)In our experience, hybrid retrieval improves recall by 15-30% over pure vector search for enterprise knowledge bases, especially for technical documentation with domain-specific jargon.
Pattern 3: Query Transformation
User queries are often vague, ambiguous, or poorly structured. Transform them before retrieval:
Query expansion — Generate multiple reformulations of the query to cast a wider retrieval net:
def expand_query(original_query: str) -> list[str]:
prompt = f"""Generate 3 alternative phrasings of this question
that might match relevant documents differently:
Original: {original_query}
Return only the 3 alternatives, one per line."""
result = llm.invoke(prompt)
alternatives = result.content.strip().split("\n")
return [original_query] + alternativesHyDE (Hypothetical Document Embeddings) — Generate a hypothetical answer, embed it, and use that embedding for retrieval. This works because a hypothetical answer is semantically closer to the actual document than the question itself:
def hyde_retrieval(query: str) -> list[Document]:
# Generate hypothetical answer
hypothetical = llm.invoke(
f"Write a brief paragraph that would answer: {query}"
)
# Embed the hypothetical answer
hyde_embedding = embeddings.embed_query(hypothetical.content)
# Search with the hypothetical embedding
return vectorstore.similarity_search_by_vector(hyde_embedding, k=10)Step-back prompting — For specific questions, generate a broader question first:
- Original: "What is the maximum batch size for model X on A100?"
- Step-back: "What are the performance characteristics and configuration options for model X?"
The broader question retrieves more relevant context that likely contains the specific answer.
Pattern 4: Re-Ranking
Initial retrieval casts a wide net (top 20-50 results). A re-ranker then scores each result's relevance to the specific query and returns the top-k:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
# Broad initial retrieval
candidates = hybrid_retriever.invoke(query) # top 30
# Score each candidate
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Return top-k by reranker score
scored_docs = sorted(
zip(candidates, scores), key=lambda x: x[1], reverse=True
)
return [doc for doc, score in scored_docs[:k]]Cross-encoder re-rankers are significantly more accurate than bi-encoder similarity for relevance scoring because they process the query and document jointly. The trade-off is speed — re-ranking 30 candidates takes 50-100ms vs. 5ms for vector search. This is why we use a two-stage pipeline: fast retrieval followed by accurate re-ranking.
Pattern 5: Contextual Compression
Even after re-ranking, retrieved chunks may contain irrelevant information that wastes prompt tokens and confuses the LLM. Contextual compression extracts only the relevant portions:
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
def compressed_retrieval(query: str) -> list[Document]:
docs = retrieve_and_rerank(query, k=5)
compressed = []
for doc in docs:
result = compressor.compress_documents([doc], query)
if result:
compressed.extend(result)
return compressedThis reduces token usage by 40-60% while maintaining answer quality — a significant cost savings at scale.
Pattern 6: Multi-Index Architecture
Enterprise knowledge bases span multiple data sources with different structures. Instead of one monolithic index, use specialized indexes:
| Index | Data Source | Embedding Model | Chunk Size |
|---|---|---|---|
| docs | Technical documentation | text-embedding-3-large | 800 tokens |
| code | Code repositories | code-specific embeddings | Function-level |
| tickets | Support tickets | text-embedding-3-small | Full ticket |
| policies | Policy documents | text-embedding-3-large | Section-level |
A routing layer classifies the incoming query and selects the appropriate index(es):
def route_query(query: str) -> list[str]:
classification = llm.invoke(
f"Classify this query into one or more categories: "
f"docs, code, tickets, policies\n\nQuery: {query}"
)
return parse_categories(classification.content)This approach dramatically improves precision by searching only relevant knowledge domains.
Pattern 7: Evaluation and Continuous Improvement
You can't improve what you can't measure. Build an evaluation pipeline:
Retrieval metrics:
- Recall@k — What percentage of relevant documents are in the top-k results?
- MRR (Mean Reciprocal Rank) — How high is the first relevant result ranked?
- NDCG — Are the most relevant results ranked highest?
End-to-end metrics:
- Answer correctness — Scored by an LLM judge against ground-truth answers
- Faithfulness — Does the answer only use information from the retrieved context?
- Answer relevance — Does the answer actually address the question asked?
Build an evaluation dataset of 100-200 query-answer pairs manually curated by domain experts. Run evaluations on every pipeline change:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_correctness, context_recall],
)
print(results)Putting It All Together
A production RAG pipeline combines these patterns into a multi-stage pipeline:
1. Query transformation — Expand/reformulate the query 2. Hybrid retrieval — Dense + sparse search across routed indexes 3. Re-ranking — Cross-encoder scoring of candidates 4. Contextual compression — Extract relevant portions 5. Generation — LLM produces an answer with source citations 6. Post-processing — Validate citations, check for hallucination markers
Conclusion
Enterprise RAG is not a single technique — it's an architecture. Each pattern addresses a specific failure mode of naive RAG: poor chunking causes lost context, single-mode retrieval misses keywords, unranked results dilute quality, and lack of evaluation means silent degradation. At MBB AI Studio, we implement these patterns incrementally with clients, measuring improvement at each stage. Start with hybrid retrieval and re-ranking — these two changes alone typically improve answer quality by 30-40% over naive RAG.