Why retrieval augmented generation breaks before it ships
Retrieval augmented generation is the technique every team reaches for first, and the one most teams get wrong in the same ways. We see it every cohort in NLP & LLM Engineering. Someone builds a working demo, shows it in week 5, and then by week 7 office hours they're back saying chunks are returning garbage, the LLM is confidently wrong, and latency jumped to four seconds.
Every time.
The demo worked because they had 50 clean documents. The production system broke because they had 50,000 messy ones. The retrieval augmented generation pattern itself is sound. The infrastructure decisions around it are where things fall apart.
This post walks through the exact stack we build in week 7 of NLP & LLM Engineering: pgvector on Postgres, HNSW indexing, hybrid search, and a cross-encoder reranking pass. Real code. Real tradeoffs.
What makes RAG with Postgres actually work
Pgvector is the right starting point for production RAG with Postgres because it lives inside the database you're already running. No separate service, no extra auth layer, no state to sync between two systems. That simplicity matters more than people admit, and I'd argue it matters more than the marginal performance you'd get from a purpose-built vector store at most production scales.
As of pgvector 0.5.0, HNSW indexing is available and it's a significant upgrade over the older IVFFlat approach. In our benchmarks on a 1.2M vector dataset (AWS RDS db.r6g.2xlarge), HNSW cut p95 query latency from 180ms to around 55ms. IVFFlat was fine for demos. Not fine for production.
Here's the baseline schema we start with:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
source_url TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
embedding VECTOR(1536)
);
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX ON documents USING GIN (to_tsvector('english', content));
That last index is not optional. You need it for the hybrid search step.
How to chunk without destroying retrieval quality
Chunking is the part most tutorials hand-wave past. It's also where most retrieval quality problems actually start.
Our default is 512 tokens with a 10% overlap, roughly 51 tokens. That overlap isn't padding. It ensures a sentence split across a chunk boundary doesn't vanish from both sides. In our June 2025 NLP cohort, a student named Marcus ran an ablation across chunk sizes from 128 to 2048 tokens on a legal document corpus. The 512-token chunks hit the best recall@5 by a 12-point margin over 1024-token chunks. The 128-token chunks had high precision but terrible coverage. Neither extreme won.
The other thing people skip is metadata. Every chunk needs to carry its source, a document ID, a section identifier, and a timestamp. Here's the ingestion code we use in the RAG & Vector DB lab:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
import psycopg2, json
client = OpenAI()
def embed_chunks(chunks: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
return [r.embedding for r in response.data]
def ingest_document(conn, text: str, metadata: dict):
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=51,
length_function=len,
)
chunks = splitter.split_text(text)
embeddings = embed_chunks(chunks)
with conn.cursor() as cur:
for chunk, embedding in zip(chunks, embeddings):
cur.execute(
"""
INSERT INTO documents (content, metadata, embedding)
VALUES (%s, %s, %s)
""",
(chunk, json.dumps(metadata), embedding)
)
conn.commit()
Don't embed in a loop. Batch your calls to the embeddings API. 512 chunks at a time is fine for text-embedding-3-small.
Why hybrid search beats vector search alone
Pure cosine similarity over embeddings is great for semantic matching. It's poor at exact keyword matching, product codes, proper nouns, version numbers. The kind of stuff that actually matters in enterprise RAG.
Hybrid search combines a BM25 full-text pass with the vector cosine similarity pass, then merges result sets using Reciprocal Rank Fusion. Not complicated to implement. On every student project we've reviewed involving structured enterprise data, hybrid search outperformed pure vector search by at least 18% on recall@10. That gap is consistent enough that I'd call pure vector search the wrong default for anything beyond a generic Q&A demo.
def hybrid_search(conn, query: str, query_embedding: list[float], top_k: int = 20):
with conn.cursor() as cur:
cur.execute("""
WITH semantic AS (
SELECT id, content, metadata,
1 - (embedding <=> %s::vector) AS score,
ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS rank
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
),
keyword AS (
SELECT id, content, metadata,
ts_rank_cd(to_tsvector('english', content),
plainto_tsquery('english', %s)) AS score,
ROW_NUMBER() OVER (
ORDER BY ts_rank_cd(to_tsvector('english', content),
plainto_tsquery('english', %s)) DESC
) AS rank
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
LIMIT %s
),
rrf AS (
SELECT COALESCE(s.id, k.id) AS id,
COALESCE(s.content, k.content) AS content,
COALESCE(s.metadata, k.metadata) AS metadata,
COALESCE(1.0/(60 + s.rank), 0) +
COALESCE(1.0/(60 + k.rank), 0) AS rrf_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
)
SELECT id, content, metadata, rrf_score
FROM rrf
ORDER BY rrf_score DESC
LIMIT %s
""", (
query_embedding, query_embedding, query_embedding, top_k,
query, query, query, top_k,
top_k
))
return cur.fetchall()
Does reranking actually move the needle?
Yes. Consistently. It's the one step most tutorials skip that has the clearest payoff.
The problem with approximate nearest-neighbor retrieval (which is what HNSW is doing) is that it trades some recall for speed. A cross-encoder reranker runs a slower but more precise relevance score over your top-K candidates and reorders them before they hit the LLM. We use cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers. It runs on CPU in under 100ms for 20 candidates, which means you're not adding meaningful latency to a pipeline that's already bottlenecked on the generation step.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, candidates: list[tuple]) -> list[tuple]:
pairs = [(query, row[1]) for row in candidates] # (query, content)
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
return [item for _, item in ranked]
Pass the top 5 reranked chunks to the LLM. Not 20, not 10. Five well-ranked chunks consistently outperform 20 loosely ranked ones because you're cutting the noise the model has to reason through. More context is not the same as better context. I've watched students go from 4-second latency and hallucination-heavy outputs to under 1.5 seconds and measurably tighter answers just by dropping from 15 retrieved chunks to 5 reranked ones.
If you want to build this end-to-end in a structured environment, the RAG & Vector DB lab in our NLP & LLM Engineering course has the full pgvector setup pre-provisioned. Week 7 is where students go from a working notebook to a deployable API. That's also where the real debugging starts.
What's your current chunk overlap strategy, and have you actually measured whether it's helping or just adding noise?