This complete RAG tutorial 2026 shows you how to build a production-ready retrieval-augmented chatbot that actually survives contact with real users โ using LangChain, ChromaDB, and patterns that work at scale.
Let's be direct: most RAG tutorials online show you how to ask a question about a single PDF. That's not RAG in production. That's a weekend demo.
Retrieval-Augmented Generation is the pattern of giving an LLM access to an external knowledge base at query time. Instead of relying purely on training-time knowledge, you fetch relevant context dynamically and include it in the prompt. The model answers based on what you gave it โ not just what it memorized.
Why does this actually matter in production?
Before touching any code โ here's the complete pipeline in two phases. The key insight: ingestion and querying are separate concerns. Keep them that way in your codebase.
# Create and activate env
python -m venv .venv
source .venv/bin/activate
# Install everything we need
pip install langchain langchain-openai langchain-community \
chromadb openai tiktoken \
pypdf sentence-transformers \
python-dotenv fastapi uvicorn
OPENAI_API_KEY=sk-...
CHROMA_PERSIST_DIR=./chroma_db
EMBED_MODEL=text-embedding-3-small
LLM_MODEL=gpt-4o-mini
text-embedding-3-small is the sweet spot for embeddings in 2026 โ cheap, fast, and accurate. For the LLM, gpt-4o-mini gives great quality-to-cost. Swap in Claude or Gemini if you prefer โ LangChain abstracts the backend cleanly.
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
load_dotenv()
def build_vector_store(docs_dir="./data/docs"):
loader = DirectoryLoader(docs_dir, glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
return vectordb
This is where most tutorials cut corners โ and where most production RAG pipelines silently fail. Chunking isn't just "split every N characters." How you split determines whether retrieval returns useful context or incoherent garbage.
A RAG pipeline is only as good as the chunks it retrieves. Bad chunking means the right information is split across boundaries the model can never bridge.
| Strategy | Best For | Gotcha |
|---|---|---|
| Fixed-size | Uniform data, fast indexing | Breaks mid-sentence constantly |
| Recursive character | General purpose | Not structure-aware |
| Semantic | Long articles, papers | Slower; embeds at split time |
For most production use cases, recursive character splitting with 800 chars and ~15% overlap is where to start.
An embedding is a high-dimensional vector representing the semantic meaning of text. Two chunks about the same concept should have vectors pointing in roughly the same direction โ even if they share no words in common.
ChromaDB is our vector store. It runs in-process, persists to disk, and has a clean Python API. Right call for most teams building their first production RAG. When you scale past ~500k chunks, look at Pinecone, Weaviate, or pgvector.
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
def build_rag_chain():
retriever = vectordb.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using ONLY the context below. {context}"),
("human", "{input}"),
])
qa_chain = create_stuff_documents_chain(llm, prompt)
return create_retrieval_chain(retriever, qa_chain)
Two metrics that matter most in practice:
For a rigorous setup, use RAGAS โ a Python library for evaluating faithfulness, answer relevancy, and context recall using LLM-based judges.
where filtersConversationBufferWindowMemory for multi-turn chatEnsembleRetriever