← All posts tutorial

Retrieval Augmented Generation with pgvector: A Production Tutorial

AI Labs · June 3, 2026 · 5 min read

Key takeaways

Chunk size at 512 tokens with 10% overlap consistently outperforms larger chunks in our NLP & LLM Engineering cohort benchmarks.
pgvector's HNSW index (added in pgvector 0.5.0) cuts query latency by 60-70% over IVFFlat on datasets above 500k vectors.
Reranking with a cross-encoder like ms-marco-MiniLM-L-6-v2 recovers 15-20% of recall lost during approximate nearest-neighbor retrieval.
Hybrid search combining BM25 full-text and pgvector cosine similarity beats either method alone on domain-specific corpora.
The most common student failure mode is skipping metadata filtering, which causes the LLM to hallucinate from irrelevant retrieved chunks.

Why retrieval augmented generation breaks before it ships

Retrieval augmented generation is the technique every team reaches for first, and the one most teams get wrong in the same ways. We see it every cohort in NLP & LLM Engineering. Someone builds a working demo, shows it in week 5, and then by week 7 office hours they're back saying chunks are returning garbage, the LLM is confidently wrong, and latency jumped to four seconds.

Every time.

The demo worked because they had 50 clean documents. The production system broke because they had 50,000 messy ones. The retrieval augmented generation pattern itself is sound. The infrastructure decisions around it are where things fall apart.

This post walks through the exact stack we build in week 7 of NLP & LLM Engineering: pgvector on Postgres, HNSW indexing, hybrid search, and a cross-encoder reranking pass. Real code. Real tradeoffs.

What makes RAG with Postgres actually work

Pgvector is the right starting point for production RAG with Postgres because it lives inside the database you're already running. No separate service, no extra auth layer, no state to sync between two systems. That simplicity matters more than people admit, and I'd argue it matters more than the marginal performance you'd get from a purpose-built vector store at most production scales.

As of pgvector 0.5.0, HNSW indexing is available and it's a significant upgrade over the older IVFFlat approach. In our benchmarks on a 1.2M vector dataset (AWS RDS db.r6g.2xlarge), HNSW cut p95 query latency from 180ms to around 55ms. IVFFlat was fine for demos. Not fine for production.

Here's the baseline schema we start with:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  content     TEXT NOT NULL,
  metadata    JSONB,
  source_url  TEXT,
  created_at  TIMESTAMPTZ DEFAULT NOW(),
  embedding   VECTOR(1536)
);

CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX ON documents USING GIN (to_tsvector('english', content));

That last index is not optional. You need it for the hybrid search step.

How to chunk without destroying retrieval quality

Chunking is the part most tutorials hand-wave past. It's also where most retrieval quality problems actually start.

Our default is 512 tokens with a 10% overlap, roughly 51 tokens. That overlap isn't padding. It ensures a sentence split across a chunk boundary doesn't vanish from both sides. In our June 2025 NLP cohort, a student named Marcus ran an ablation across chunk sizes from 128 to 2048 tokens on a legal document corpus. The 512-token chunks hit the best recall@5 by a 12-point margin over 1024-token chunks. The 128-token chunks had high precision but terrible coverage. Neither extreme won.

The other thing people skip is metadata. Every chunk needs to carry its source, a document ID, a section identifier, and a timestamp. Here's the ingestion code we use in the RAG & Vector DB lab:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
import psycopg2, json

client = OpenAI()

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    return [r.embedding for r in response.data]

def ingest_document(conn, text: str, metadata: dict):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=51,
        length_function=len,
    )
    chunks = splitter.split_text(text)
    embeddings = embed_chunks(chunks)

    with conn.cursor() as cur:
        for chunk, embedding in zip(chunks, embeddings):
            cur.execute(
                """
                INSERT INTO documents (content, metadata, embedding)
                VALUES (%s, %s, %s)
                """,
                (chunk, json.dumps(metadata), embedding)
            )
    conn.commit()

Don't embed in a loop. Batch your calls to the embeddings API. 512 chunks at a time is fine for text-embedding-3-small.

Why hybrid search beats vector search alone

Pure cosine similarity over embeddings is great for semantic matching. It's poor at exact keyword matching, product codes, proper nouns, version numbers. The kind of stuff that actually matters in enterprise RAG.

Hybrid search combines a BM25 full-text pass with the vector cosine similarity pass, then merges result sets using Reciprocal Rank Fusion. Not complicated to implement. On every student project we've reviewed involving structured enterprise data, hybrid search outperformed pure vector search by at least 18% on recall@10. That gap is consistent enough that I'd call pure vector search the wrong default for anything beyond a generic Q&A demo.

def hybrid_search(conn, query: str, query_embedding: list[float], top_k: int = 20):
    with conn.cursor() as cur:
        cur.execute("""
            WITH semantic AS (
                SELECT id, content, metadata,
                       1 - (embedding <=> %s::vector) AS score,
                       ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS rank
                FROM documents
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            ),
            keyword AS (
                SELECT id, content, metadata,
                       ts_rank_cd(to_tsvector('english', content),
                                  plainto_tsquery('english', %s)) AS score,
                       ROW_NUMBER() OVER (
                           ORDER BY ts_rank_cd(to_tsvector('english', content),
                                               plainto_tsquery('english', %s)) DESC
                       ) AS rank
                FROM documents
                WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
                LIMIT %s
            ),
            rrf AS (
                SELECT COALESCE(s.id, k.id) AS id,
                       COALESCE(s.content, k.content) AS content,
                       COALESCE(s.metadata, k.metadata) AS metadata,
                       COALESCE(1.0/(60 + s.rank), 0) +
                       COALESCE(1.0/(60 + k.rank), 0) AS rrf_score
                FROM semantic s
                FULL OUTER JOIN keyword k ON s.id = k.id
            )
            SELECT id, content, metadata, rrf_score
            FROM rrf
            ORDER BY rrf_score DESC
            LIMIT %s
        """, (
            query_embedding, query_embedding, query_embedding, top_k,
            query, query, query, top_k,
            top_k
        ))
        return cur.fetchall()

Does reranking actually move the needle?

Yes. Consistently. It's the one step most tutorials skip that has the clearest payoff.

The problem with approximate nearest-neighbor retrieval (which is what HNSW is doing) is that it trades some recall for speed. A cross-encoder reranker runs a slower but more precise relevance score over your top-K candidates and reorders them before they hit the LLM. We use cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers. It runs on CPU in under 100ms for 20 candidates, which means you're not adding meaningful latency to a pipeline that's already bottlenecked on the generation step.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, candidates: list[tuple]) -> list[tuple]:
    pairs = [(query, row[1]) for row in candidates]  # (query, content)
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [item for _, item in ranked]

Pass the top 5 reranked chunks to the LLM. Not 20, not 10. Five well-ranked chunks consistently outperform 20 loosely ranked ones because you're cutting the noise the model has to reason through. More context is not the same as better context. I've watched students go from 4-second latency and hallucination-heavy outputs to under 1.5 seconds and measurably tighter answers just by dropping from 15 retrieved chunks to 5 reranked ones.

If you want to build this end-to-end in a structured environment, the RAG & Vector DB lab in our NLP & LLM Engineering course has the full pgvector setup pre-provisioned. Week 7 is where students go from a working notebook to a deployable API. That's also where the real debugging starts.

What's your current chunk overlap strategy, and have you actually measured whether it's helping or just adding noise?

Frequently asked questions

What is retrieval augmented generation and when should I use it instead of fine-tuning?+

Retrieval augmented generation is a pattern where you fetch relevant documents at inference time and inject them into the prompt, so the model answers from retrieved context rather than from parametric memory alone. Use it when your knowledge base changes frequently or when you need source citations. Fine-tune instead when latency is critical and your knowledge is stable.

Is pgvector good enough for production RAG, or do I need a dedicated vector database?+

pgvector with HNSW indexing handles tens of millions of vectors and sub-50ms queries at moderate concurrency on a well-provisioned Postgres instance. For most production RAG systems, that's enough. Dedicated vector databases like Pinecone or Weaviate are worth evaluating only when you're above 100M vectors or need multi-tenant isolation at extreme scale.

What chunk size should I use for retrieval augmented generation?+

512 tokens with roughly 10% overlap is a strong default for most corpora. Shorter chunks (128-256 tokens) work better for factual Q&A over dense technical docs. Longer chunks (1024+) help when the model needs surrounding context to answer correctly, but they hurt retrieval precision.

How do I run vector search in Postgres with pgvector?+

Install the pgvector extension, add a vector(1536) column to your table, build an HNSW index with `CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)`, then query with the `<=>` cosine distance operator. The RAG & Vector DB lab environment at AI Labs has this pre-configured.

retrieval-augmented-generation pgvector-tutorial rag-with-postgres vector-search-postgres nlp-llm-engineering llm

Ready to learn AI seriously?

Browse our 13 live, instructor-led programs.

Explore Courses

QLoRA Explained: Fit a 13B Model on One L4 GPU

QLoRA fits a 13B LLM on one L4 GPU by pairing 4-bit quantization with LoRA adapters. Here's exactly how it works and how to run it yourself.

June 27, 2026

Notebook to Google Vertex AI Pipeline: MLOps End-to-End

Most ML models die in notebooks. Here's exactly what it takes to get one running on Google Vertex AI Pipelines — from first component to live endpoint.

June 26, 2026

Build a RAG Chatbot with pgvector and Claude

Week 8 of our NLP & LLM Engineering course ends with students shipping a working RAG chatbot that answers questions over their own PDFs. Here's the full build.

June 4, 2026