← All posts tutorial

Build a RAG Chatbot with pgvector and Claude

AI Labs · June 4, 2026 · 5 min read

Key takeaways

A RAG chatbot retrieves document chunks from pgvector at query time and injects them into the prompt, keeping Claude's answers grounded in your source material.
Chunk size matters more than most students expect: 512-token chunks with 50-token overlap outperformed 1024-token chunks in 7 of 10 test queries during our June 2025 cohort review.
Claude Sonnet 4.6 handles 200K context, but large contexts inflate cost fast — retrieval with top-5 chunks keeps latency under 800ms and token spend manageable.
pgvector on a standard Postgres instance is sufficient for document collections under ~100K chunks; you don't need a dedicated vector DB for most classroom or small-production use cases.
LangChain 0.3's PGVector integration handles embedding storage and similarity search in roughly 15 lines of code, removing most of the boilerplate.

What does a RAG chatbot actually do under the hood?

A RAG chatbot answers questions by retrieving relevant text from your documents before it generates anything. That's the whole mechanism. You split documents into chunks, embed each chunk as a vector, store those vectors in a database, and at query time you embed the user's question, pull the closest chunks by cosine similarity, then hand everything to Claude and say "answer using only this."

Sounds simple. The devil is in the chunk size, the embedding model choice, and getting the retrieval step to surface the right passages instead of adjacent noise. We've run this project in week 8 of NLP & LLM Engineering for three cohorts now. Every single time, at least two groups ship a bot that confidently gives wrong answers because their chunks were too large or their similarity threshold too loose.

Here's the build we use, the mistakes we see every cohort, and the specific numbers that actually matter.

Why this stack is intentionally boring

Our week-8 stack: Postgres with the pgvector extension for vector storage, LangChain 0.3 for the plumbing, Claude Sonnet 4.6 for generation, and text-embedding-3-small from OpenAI for embeddings (1536 dimensions, cheap, accurate enough for document retrieval at this scale).

You don't need Pinecone. You don't need Weaviate. For document collections under roughly 100K chunks, pgvector inside a standard Postgres instance is fast, and it's one fewer service to break at 2am. Students in our last cohort deployed this as a single Cloud Run container talking to a Cloud SQL Postgres instance and shipped it on a Thursday afternoon. Done.

pip install anthropic langchain langchain-community langchain-postgres \
            psycopg2-binary pgvector pypdf openai

CREATE EXTENSION IF NOT EXISTS vector;

That's the whole setup. One extension, one connection string, no managed vector service to configure.

How do chunk size and overlap actually affect retrieval?

Bad chunking destroys retrieval quality before you've written a single query. The default 1000-token chunk with zero overlap that LangChain's own docs show is fine for demos. It's not fine for dense technical PDFs, legal contracts, or anything where a sentence at the end of one chunk depends on context from the previous one.

In our June 2025 cohort review, we tested 512-token chunks with 50-token overlap against 1024-token chunks with no overlap across a 40-page ML textbook. Smaller chunks with overlap won on 7 of 10 test queries. The failure mode on large chunks is specific: the retrieved passage contains the right answer buried inside a lot of irrelevant surrounding text, and Claude either misses it or averages it away. Smaller chunks force the retrieval step to be precise.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings

loader = PyPDFLoader("your_document.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(pages)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

CONNECTION = "postgresql+psycopg2://user:password@localhost:5432/ragdb"

vectorstore = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    connection=CONNECTION,
    collection_name="my_documents"
)

Run that once per document. The vectors persist in Postgres. That's your pgvector RAG setup in 20 lines.

Wiring Claude into the retrieval chain

Once the vectors are in Postgres, the query side is almost anticlimactic. Embed the user's question, pull the top-5 most similar chunks, format them into a prompt, send it to Claude Sonnet 4.6 with a strict instruction to stay inside the provided context.

import anthropic

client = anthropic.Anthropic()

def ask(question: str, vectorstore: PGVector, k: int = 5) -> str:
    docs = vectorstore.similarity_search(question, k=k)
    context = "\n\n---\n\n".join(d.page_content for d in docs)

    prompt = f"""You are a helpful assistant. Answer the user's question 
    using ONLY the context below. If the answer is not in the context, 
    say you don't know.\n\nContext:\n{context}\n\nQuestion: {question}"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

vectorstore = PGVector(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    connection=CONNECTION,
    collection_name="my_documents"
)

print(ask("What is the learning rate schedule described in chapter 3?", vectorstore))

The k=5 is your main dial. Above 8 retrieved chunks and you start adding noise. Below 3 and you risk the relevant passage not making it into the context at all. I've seen students set k=15 thinking more context is always better. It isn't. Claude's accuracy dropped noticeably on our test set once we crossed 8 chunks, probably because the signal-to-noise ratio in the prompt gets ugly.

Where students actually get stuck

The most common failure in our RAG & Vector DB lab isn't the code. It's evaluation. Students ship the bot, type a few questions, get plausible-looking answers, and declare victory. Then someone asks a question that requires combining two passages 15 pages apart in the document and the bot fails completely. Silently. With confidence.

The fix we push in every capstone review: write a small eval script before you show anyone the demo. Take 15-20 question-answer pairs where you already know the correct answers from your documents. Run them through the bot. Measure retrieval recall (does the right chunk even show up in the top-5?) separately from generation accuracy (does Claude's final answer match?). Those two numbers tell you immediately whether your problem lives in retrieval or generation. Skipping this step is the single most expensive shortcut we see.

We also see students panic when pgvector query latency climbs above 200ms on a larger collection. The fix is almost always just an IVFFlat index:

CREATE INDEX ON langchain_pg_embedding 
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

That drops query time under 30ms on collections up to 500K vectors in our testing. One index, problem solved.

The full build, including evaluation frameworks, hybrid search with BM25, and a real discussion of when you should abandon pgvector for something like Weaviate, is part of NLP & LLM Engineering.

One thing worth sitting with after you get your RAG chatbot working on PDFs: what happens when the documents update? Most people don't think about that until they're six months into production and the bot is confidently answering questions from a document version that's eight months stale.

Frequently asked questions

What is a RAG chatbot and how does it differ from a regular chatbot?+

A RAG chatbot retrieves relevant chunks of text from an external knowledge base before generating a response, so answers are grounded in specific documents rather than only the model's training data. A standard chatbot has no access to your documents and will hallucinate details it doesn't know.

Why use pgvector instead of a dedicated vector database like Pinecone or Weaviate?+

For collections under roughly 100K chunks, pgvector running inside an existing Postgres instance is fast enough and far simpler to operate. You skip an extra service, an extra bill, and an extra failure point. We switch students to dedicated vector DBs only when the dataset grows large or the query throughput demands it.

Which Claude model works best for a PDF question answering AI?+

Claude Sonnet 4.6 is our current default for this project. It's fast, cost-effective at classroom scale, and handles nuanced document reasoning well. Claude Opus 4 is worth testing if precision on dense technical documents is the priority, but the latency and cost jump is real.

How many chunks should I retrieve for each query in a RAG chatbot?+

Top-5 is a good starting point. Fewer than 3 and you risk missing the relevant passage; more than 8 and you're padding the context with noise that confuses the model. Tune this with actual test queries against your specific document collection.

rag-chatbot build-rag-chatbot pgvector-rag-tutorial claude-rag-application pdf-question-answering-ai nlp

Ready to learn AI seriously?

Browse our 13 live, instructor-led programs.

Explore Courses

QLoRA Explained: Fit a 13B Model on One L4 GPU

QLoRA fits a 13B LLM on one L4 GPU by pairing 4-bit quantization with LoRA adapters. Here's exactly how it works and how to run it yourself.

June 27, 2026

Notebook to Google Vertex AI Pipeline: MLOps End-to-End

Most ML models die in notebooks. Here's exactly what it takes to get one running on Google Vertex AI Pipelines — from first component to live endpoint.

June 26, 2026

Retrieval Augmented Generation with pgvector: A Production Tutorial

We teach retrieval augmented generation with pgvector in week 7 of NLP & LLM Engineering. Here's exactly what the build looks like.

June 3, 2026