RAG vs Fine-Tuning: Why This Question Comes Up Every Week
The RAG vs fine-tuning debate is the single most common question in our office hours. Not "how do I train a model" or "what's a transformer." This specific question, every week, from students who are four weeks into a real project and suddenly realize they've committed to an approach without a framework for choosing it.
Last month I counted: six separate students in one Tuesday session asked a version of the same thing. They'd read blog posts. Seen benchmarks. None of it gave them a number they could act on.
This post is that number. Several of them, actually, pulled from work we've done in our LLM Fine-Tuning (A100) lab and from watching capstone projects succeed and fail in our NLP & LLM Engineering course.
Why Prompting Alone Handles More Than You Think
Prompt engineering solves the problem before you need RAG or fine-tuning in roughly 70% of student capstone projects. Not a soft estimate. I started tracking it after a student in our March 2025 cohort spent three weeks building a RAG pipeline for a customer support bot, then discovered that a well-structured system prompt with four examples got him to 91% of the same answer quality at zero retrieval cost.
Prompt engineering vs fine-tuning is almost always the first comparison you should make, not the last. The mistake is treating prompting as the "basic" option you graduate out of. Claude Sonnet 4.6 and GPT-4o with a strong system prompt and two or three well-chosen few-shot examples are genuinely powerful. If you haven't spent at least a full day on prompt iteration, you're not ready to talk about fine-tuning.
Where prompting breaks down: tasks that require consistent structured output at scale, domain-specific reasoning the base model simply doesn't have (think proprietary financial instruments or rare clinical terminology), and situations where you're paying per-token on a long context that could be baked into weights instead.
When to Fine-Tune an LLM: The Numbers That Actually Matter
Fine-tuning earns its place when latency is a hard constraint, when your output format needs to be mechanically consistent across thousands of calls, or when the knowledge is static and query volume is high enough that retrieval overhead accumulates.
Here's what we've measured in the LLM Fine-Tuning (A100) lab. A fine-tuned Llama 3.1 70B, quantized to 4-bit with QLoRA, responds in under 180ms on an A100 80GB for most prompts under 512 tokens. A comparable RAG pipeline (pgvector retrieval on a 100k-document corpus, top-5 chunks, then generation) runs 300 to 800ms depending on chunk size and reranker configuration. That gap matters if you're building anything user-facing.
Training cost for a 10,000-sample supervised fine-tuning job on Llama 3.1 70B runs about $12-18 per run in our lab. You'll likely run three to eight iterations before you're satisfied with the result. Call it $60-120 total. A managed pgvector instance on a cloud provider, with the ingestion pipeline and embedding API calls factored in, often costs more than that per month at moderate query volume.
The crossover point, roughly: if your knowledge base is static and you expect more than a few thousand queries per month, fine-tuning can be cheaper over a six-month horizon. If your data changes weekly, that calculation flips hard.
How Should You Actually Choose?
I'd walk through this as a sequence of honest questions, not a flowchart. I've seen too many flowcharts lead people astray.
Have you maxed out prompting? Two hours minimum of systematic prompt iteration with few-shot examples. If you haven't, go do that first.
How often does your knowledge change? If the answer is "daily" or "weekly", RAG is almost certainly easier to maintain. Retrieval pipelines let you update your vector store without retraining. Fine-tuning a new model every time your product catalog changes is operationally painful.
What does your latency budget look like? Sub-200ms is hard to hit with RAG unless you have aggressively optimized retrieval. Fine-tuning wins on latency, consistently.
Do you need format consistency or specialized reasoning? A fine-tuned model that always returns valid JSON in your exact schema is much easier to integrate than a prompted model that does it 94% of the time. That 6% gap causes production incidents.
RAG vs fine-tuning as a binary is also a bit of a false frame. You can fine-tune the generator inside a RAG pipeline, and sometimes that's the right answer. But that combination adds two failure surfaces. We've seen students in our MLOps & AI Infrastructure course spend weeks debugging systems where they couldn't tell if a bad output was a retrieval failure or a model failure. Build each piece in isolation first.
The Hidden Cost Nobody Talks About
The real cost in LLM customization isn't compute. It's iteration time and maintenance.
Fine-tuning requires a clean, labeled dataset. Building that for a real domain usually takes longer than the training run itself. RAG requires a retrieval pipeline you actually trust: chunking strategy, embedding model choice (we've had good results with text-embedding-3-large and bge-large-en-v1.5), reranking, and a vector store that doesn't fall over under load.
Neither option is free. The question is which debt you'd rather carry given your specific constraints.
For teams moving fast with changing requirements: RAG, and use our RAG & Vector DB lab to prototype before committing to infrastructure. For teams with stable data and strict latency or format requirements: fine-tune, and expect to spend as much time on your dataset as on your training run.
The students who make this choice well are the ones who've tried both in a real environment. The ones who get stuck made the call based on a blog post. Possibly this one. So go run the experiment.