How to Fine-Tune an LLM on a Single A100: Full Walkthrough
Last month, a student in our NLP & LLM Engineering cohort asked me if fine-tuning a 7B model required a multi-GPU cluster. She'd seen a few tutorials that implied as much. The honest answer: no, not anymore. Fine-tuning an LLM on a single A100 80GB is completely realistic today, and we run through it every week inside our LLM Fine-Tuning (A100) lab environment.
This post covers exactly what we do. Real code, real VRAM numbers, and a few places where we've watched students get burned.
How Do You Fine-Tune an LLM Without Running Out of Memory?
QLoRA. That's the thing that made single-GPU fine-tuning practical for 7B models, and it works because two ideas reinforce each other: 4-bit NF4 quantization (via bitsandbytes) shrinks the frozen base model's memory footprint, while LoRA adapters (via PEFT) keep the trainable parameter count small enough to fit alongside it.
On a full bf16 fine-tune of Llama 3.1 7B, you'd need roughly 56GB just to hold the weights before activations. With QLoRA, the frozen backbone drops to around 4GB in 4-bit, and you're training maybe 20 to 40 million adapter parameters instead of 7 billion. Peak VRAM during our standard run sits around 38GB, on a sequence length of 2048 with a batch size of 4 and gradient accumulation of 8.
One thing we tell every student before they touch the training loop: gradient checkpointing is not optional. Turn it off on a 7B model and you'll blow past 80GB before the second batch. Not at the end of training. The second batch.
The Environment Setup That Actually Works
Our LLM Fine-Tuning (A100) lab environment comes pre-loaded with CUDA 12.4, PyTorch 2.4, bitsandbytes 0.43, PEFT 0.11, and Transformers 4.44. That matters more than it sounds. Version mismatches between bitsandbytes and CUDA are the single biggest time sink we see in office hours. Students who spin up a generic Colab and assemble dependencies from scratch typically spend 45 minutes debugging imports before writing a single training line. I've watched it happen in at least four sessions this quarter.
If you're working outside our lab, here's the install that's been stable for us:
pip install transformers==4.44.0 peft==0.11.1 bitsandbytes==0.43.3 \
trl==0.9.6 accelerate==0.33.0 datasets==2.20.0
Don't mix versions arbitrarily. The trl version matters specifically because we use SFTTrainer, and the API changed between 0.8 and 0.9 in ways that don't throw obvious errors; they just produce bad training runs.
The Core Fine-Tuning Code, Step by Step
Here's what our lab walkthrough actually runs. We use Llama 3.1 7B Instruct as the base, a small instruction-following dataset from Hugging Face, and a rank-16 LoRA config.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# LoRA adapter config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")
training_args = SFTConfig(
output_dir="./llama31-qlora-out",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
bf16=True,
max_seq_length=2048,
logging_steps=25,
save_strategy="epoch",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
)
trainer.train()
Two things in here that students tend to skip past. We use paged_adamw_8bit instead of standard AdamW because it offloads optimizer states to CPU when GPU memory pressure spikes, saving roughly 6GB during the first few steps when VRAM allocation is least predictable. And bnb_4bit_use_double_quant=True quantizes the quantization constants themselves, which sounds recursive but shaves another 0.4 bits per parameter in practice. Small. Adds up.
Trainable parameters come out to about 21 million. 0.26% of the full model. That's the number that makes single-A100 fine-tuning tractable.
What Actually Goes Wrong in the Lab
Three failure modes show up repeatedly in our capstone reviews, and all three are boring in hindsight.
Padding side. Llama is a decoder model. Padding on the left causes subtle training instability because attention masks interact with positional embeddings in ways that don't crash your run; they just produce loss spikes around step 200 that are maddening to diagnose. Right-padding plus pad_token = eos_token fixes it. Set it before you start.
Dataset formatting. SFTTrainer expects a text column with the conversation already formatted into the model's chat template. Students who pass raw messages lists get either a silent failure or outputs that look coherent but aren't. Run tokenizer.apply_chat_template on your dataset before training.
Evaluation data leakage. In our June 2025 cohort, three students trained on their entire dataset and then wondered why eval perplexity looked great but inference outputs were repetitive and hollow. They'd never split the data. Split first. Every time, without exception.
The NLP & LLM Engineering course covers all of this in module 4, but honestly the lab is where it actually lands. Reading about padding tokens and watching your loss curve go sideways because of them are different experiences entirely.
The infrastructure constraint on 7B fine-tuning is basically solved. What's left is the configuration layer, and that's messier than any tutorial admits. Knowing which of the fifteen things you set up is the one causing your loss to plateau at 2.4 is still the hard part.