Transformer Attention Explained for Backend Engineers

Transformer Attention Is Why Your LLM Call Gets Slow at 8K Tokens

Last spring, a backend engineer in one of our cohorts filed a bug. His API calls to Claude were taking four seconds at 2K tokens and nearly eighteen seconds at 8K tokens. He thought it was a network issue. It wasn't. It was transformer attention doing exactly what it's supposed to do, and he had no mental model for why.

If you're calling an LLM API without understanding the self-attention mechanism underneath it, you're flying blind. Not in a poetic way. In a very concrete "I can't explain this latency spike to my team" way.

So let's fix that.

What the Self-Attention Mechanism Is Actually Computing

The foundational paper "Attention Is All You Need" (Vaswani et al., 2017) introduced a deceptively simple operation. For every token in your input, the model computes three vectors: a query, a key, and a value. These come from multiplying the token's embedding by three learned weight matrices (W_Q, W_K, W_V).

Then for each token, you compute a dot product between its query and the keys of every other token, scale it, run a softmax, and use the result to take a weighted sum of the value vectors.

In code it looks like this:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

That torch.matmul(Q, K.transpose(-2, -1)) is the line that should give you pause. If your sequence has N tokens, that operation produces an N×N matrix. Double the sequence length, quadruple the compute. That's not a bug in any particular implementation. It's the math.

This is why your 8K-token call doesn't cost four times your 2K-token call. It costs sixteen times as much in attention operations. The feed-forward layers and layer norms scale linearly. Attention does not. That gap is the entire explanation for what the engineer in our cohort was seeing, and no amount of connection-pool tuning was going to fix it.

Multi-Head Attention: Why One Set of Q, K, V Isn't Enough

Real transformer models don't run one attention operation. They run many in parallel. Multi-head attention splits the embedding dimension across several heads, each with its own Q, K, V projections, then concatenates the results.

GPT-style models in the larger variants run 96 heads. Each head learns to attend to different relationships. One might track subject-verb agreement across a long clause; another might resolve pronoun co-references fifty tokens back. The model isn't explicitly programmed to do those specific things. It learns them from data. But a single attention pass simply doesn't have the capacity to capture multiple dependency types at once, which is the actual reason multi-head exists, not just for "richer representations" in the abstract sense.

For you as an API caller, the practical consequence is that KV cache size scales with both sequence length and the number of heads. When Anthropic or OpenAI prices a context window, the underlying cost is mostly KV cache memory. A 200K-token context on Claude Sonnet 4.5 isn't cheap on their side either. They're just absorbing more of that cost than you might think.

Why This Changes How You Should Use the API

Once you understand how LLMs work at the attention level, a few common API patterns start to look obviously wrong.

Stuffing the context window by default. I've seen teams dump entire codebases or document stores into every request because "bigger context means better answers". Sometimes that's true. More often you're paying quadratic attention costs for tokens that contribute almost nothing to the final output. Retrieval with pgvector or a RAG pipeline is frequently the right call. Our RAG & Vector DB lab environment exists specifically for building and benchmarking that tradeoff.

Setting max_tokens too high. Output tokens also consume attention at each generation step. If your use case produces outputs under 300 tokens 95% of the time and you've set max_tokens to 4096, you're not getting better answers. You're leaving the door open for expensive edge cases, and those cases will find you.

Ignoring prefill versus decode latency. Prefill, which is processing your input tokens, is highly parallelizable on modern A100 80GB hardware. Decode, which is generating output one token at a time, is memory-bandwidth-bound and mostly sequential. A 6,000-token system prompt with a 50-token response will feel faster than a 500-token prompt with a 2,000-token response, even if the total token counts look similar on paper. Four engineers in our most recent NLP & LLM Engineering office hours were confused about exactly this until we pulled up actual latency traces side by side.

The Intuition That Actually Sticks

Here's how I explain transformer attention to engineers who've never read the paper. Every token in your sequence gets to "ask a question" (query) of every other token. The other tokens "advertise what they know" (keys). The dot product measures how relevant each answer is. The value vectors carry the actual content that flows forward when relevance is high.

Not perfectly rigorous. Accurate enough to reason about production behavior.

When a model "loses track" of something mentioned early in a long conversation, it's often because the attention weights for those early tokens got diluted by everything that came after. The information is still in the context. It just isn't being weighted heavily. That's a testable, fixable thing once you have the mental model for it.

Our NLP & LLM Engineering course works through the full attention math and then has students profile real inference runs on the LLM Inference (L4) lab environment. We watch the quadratic scaling show up in our own latency charts. Numbers on a graph hit differently than equations on a whiteboard.

Knowing the math won't make you a researcher. But it will make you the engineer on your team who can explain why the chatbot slowed down after someone doubled the system prompt. That engineer is worth having around.

Ready to learn AI seriously?

Browse our 13 live, instructor-led programs.

Explore Courses