Article #17 ยท LLM Engineering ยท 2026

RAG Tutorial 2026.
Production Grade. LangChain + ChromaDB.

This complete RAG tutorial 2026 shows you how to build a production-ready retrieval-augmented chatbot that actually survives contact with real users โ€” using LangChain, ChromaDB, and patterns that work at scale.

๐Ÿ”ฅ ~50k searches/mo โœฆ April 2026 โฑ ~25 min read Python 3.11+
// table of contents

// 01What RAG actually is โ€” and what it isn't

Let's be direct: most RAG tutorials online show you how to ask a question about a single PDF. That's not RAG in production. That's a weekend demo.

Retrieval-Augmented Generation is the pattern of giving an LLM access to an external knowledge base at query time. Instead of relying purely on training-time knowledge, you fetch relevant context dynamically and include it in the prompt. The model answers based on what you gave it โ€” not just what it memorized.

Why does this actually matter in production?

โš  common misconception
RAG is not a replacement for fine-tuning. Fine-tuning changes how the model reasons and responds. RAG gives it access to new facts. They serve different goals and often work best together.

// 02Architecture overview

Before touching any code โ€” here's the complete pipeline in two phases. The key insight: ingestion and querying are separate concerns. Keep them that way in your codebase.

// Ingestion pipeline (runs offline / on schedule)
Raw Docs โ†’ Chunker โ†’ Embedder โ†’ ChromaDB
// Query pipeline (runs at request time)
User Query โ†’ Embed Query โ†’ Vector Search โ†’ LLM โ†’ Answer

// 03Project setup & dependencies

terminal
# Create and activate env
python -m venv .venv
source .venv/bin/activate

# Install everything we need
pip install langchain langchain-openai langchain-community \
            chromadb openai tiktoken \
            pypdf sentence-transformers \
            python-dotenv fastapi uvicorn
.env
OPENAI_API_KEY=sk-...
CHROMA_PERSIST_DIR=./chroma_db
EMBED_MODEL=text-embedding-3-small
LLM_MODEL=gpt-4o-mini
๐Ÿ’ก model choice
text-embedding-3-small is the sweet spot for embeddings in 2026 โ€” cheap, fast, and accurate. For the LLM, gpt-4o-mini gives great quality-to-cost. Swap in Claude or Gemini if you prefer โ€” LangChain abstracts the backend cleanly.

// 04Document ingestion pipeline

ingest.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

load_dotenv()

def build_vector_store(docs_dir="./data/docs"):
    loader = DirectoryLoader(docs_dir, glob="**/*.pdf", loader_cls=PyPDFLoader)
    docs = loader.load()
    
    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
    chunks = splitter.split_documents(docs)
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
    return vectordb

// 05Chunking strategies that actually matter

This is where most tutorials cut corners โ€” and where most production RAG pipelines silently fail. Chunking isn't just "split every N characters." How you split determines whether retrieval returns useful context or incoherent garbage.

A RAG pipeline is only as good as the chunks it retrieves. Bad chunking means the right information is split across boundaries the model can never bridge.

StrategyBest ForGotcha
Fixed-sizeUniform data, fast indexingBreaks mid-sentence constantly
Recursive characterGeneral purposeNot structure-aware
SemanticLong articles, papersSlower; embeds at split time

For most production use cases, recursive character splitting with 800 chars and ~15% overlap is where to start.

๐Ÿ“Œ rule of thumb
Your chunk size should be smaller than the "useful unit of information" in your documents.

// 06Embeddings and ChromaDB

An embedding is a high-dimensional vector representing the semantic meaning of text. Two chunks about the same concept should have vectors pointing in roughly the same direction โ€” even if they share no words in common.

ChromaDB is our vector store. It runs in-process, persists to disk, and has a clean Python API. Right call for most teams building their first production RAG. When you scale past ~500k chunks, look at Pinecone, Weaviate, or pgvector.

// 07Retrieval chain + FastAPI server

retriever.py
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

def build_rag_chain():
    retriever = vectordb.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 5, "fetch_k": 20}
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer using ONLY the context below. {context}"),
        ("human", "{input}"),
    ])
    
    qa_chain = create_stuff_documents_chain(llm, prompt)
    return create_retrieval_chain(retriever, qa_chain)
๐Ÿ“Œ why MMR matters
Default similarity search returns near-duplicate chunks. MMR trades a little similarity for diversity, giving broader coverage and noticeably better answers.

// 08Evaluating your RAG pipeline

Two metrics that matter most in practice:

For a rigorous setup, use RAGAS โ€” a Python library for evaluating faithfulness, answer relevancy, and context recall using LLM-based judges.

// 09Production checklist

// 10Where to go from here

โœฆ final word
The most important discipline is measurement. A RAG system without evals is just vibes engineering. Start measuring from day one.