RAG for Beginners: Build a Document Q&A Bot With LangChain

If you ask ChatGPT for the current employee leave policy of your specific company, it will hallucinate wildly. It will confidently tell you that you get unlimited PTO based on generic training data it absorbed from a random SaaS startup's blog in 2022.

Large Language Models are incredible reasoning engines, but they are terrible databases. If you want an LLM to answer questions about proprietary data safely, you must explicitly pass that data into its context window. When you have a 3-page PDF, you can just copy-paste it into the prompt. But when you have a 400-page corporate manual, it exceeds the context limit, and even if it didn't, pasting 400 pages would cost you $5 per question in API fees.

The solution is Retrieval-Augmented Generation (RAG). This is the cornerstone of enterprise AI Engineering. We are going to build one right now using Python and LangChain.

The RAG Blueprint

RAG operates in three distinct phases:

Chunking: We slice the massive document into 500-word paragraphs.
Embedding: We use an AI model to turn each paragraph into an array of thousands of numbers (a vector) representing the semantic meaning of that paragraph, and store it in a database.
Retrieval: When a user asks a question, we turn the question into a vector, mathematically find the 3 paragraphs that closely match the question's vector, and send only those 3 paragraphs to the LLM to get an answer.

Step 1: The Setup

We are going to use the OpenAI API for embeddings and text generation, and ChromaDB (a fast, local vector database) to store the data.

pip install langchain langchain-openai chromadb python-dotenv tiktoken

Ensure you have your .env file set up with your OpenAI key, exactly as we covered in the ChatGPT API Python tutorial.

Step 2: Ingesting and Chunking Data

Create a file named rag_bot.py. First, we need data. Let's pretend we have a long text file called company_wiki.txt sitting in our directory. We're going to load it and slice it into chunks.

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

# 1. Load the huge document
loader = TextLoader("company_wiki.txt")
document = loader.load()

# 2. Slice it up
# A chunk size of 1000 characters is roughly 200 words. 
# The overlap ensures we don't cut a sentence strictly in half.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(document)
print(f"Split document into {len(chunks)} searchable chunks.")

Step 3: Storing Vectors in ChromaDB

Now we pass those chunks through OpenAI's embedding model (text-embedding-3-small). The model looks at each chunk, calculates its semantic meaning, and assigns it a 1,536-dimensional coordinate. ChromaDB stores those coordinates locally.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize the embedding model
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a local vector database in memory
# Behind the scenes, this sends our text to OpenAI, gets the vectors, and stores them in Chroma.
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=embeddings_model
)

# Set the database to act as a "retriever" that brings back the top 3 most relevant chunks
retriever = vector_db.as_retriever(search_kwargs={"k": 3})

Step 4: The System Prompt (Anti-Hallucination)

This is where Prompt Engineering meets architecture. We must instruct the LLM to strictly operate within the confines of the retrieved chunks. If we don't, it will fall back on its pre-existing training data.

from langchain_core.prompts import ChatPromptTemplate

system_template = """Use ONLY the following pieces of retrieved context to answer the user's question. 
If the answer is not contained in the context, output exactly: "I cannot answer this based on the provided documents." 
Do NOT try to make up an answer.


{context}
"""

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("user", "{question}")
])

Step 5: Wiring the Chain Together

LangChain uses LCEL (LangChain Expression Language) to glue these components together seamlessly. We tell the chain: take the question, grab the context from the retriever, format the prompt, and send it to the LLM.

from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Use a highly deterministic model for RAG extraction
# Temperature 0 is critical. See our parameter guide for why.
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# The RAG Pipeline
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# Execute the query
user_query = "What is the policy for remote work on Fridays?"
print(f"Question: {user_query}")

answer = rag_chain.invoke(user_query)
print(f"\nAnswer: {answer}")

Scaling to Production

What you just built is a functional, memory-based RAG pipeline. If you run this file, it generates the database on the fly and queries it flawlessly.

When you transition this to an enterprise environment, the code logic remains largely identical, but you swap out the components for scale. Instead of Chroma in memory, you push your vectors to a hosted Pinecone or AWS OpenSearch index. Instead of loading a text file, your loader pulls directly from the company's Confluence API using a chron job on a server.

You have just bridged the gap between a chat interface and private data. Welcome to real AI engineering.