Inference Engineering
How generative-model inference actually works in production — from a single CUDA kernel up to multi-cloud autoscaling.
I can train a model. I can also stand one up behind an API and watch it fall over the moment two people use it at once. That gap — between "the weights exist" and "the weights serve a thousand users fast and cheap and reliably" — is the whole subject here.
I worked through Philip Kiely's Inference Engineering because I do robot inference and kept hitting walls I didn't have the vocabulary for. These are my notes from that read, rewritten in my own words. I'm not reproducing the book; I'm explaining the things to myself the way I'd explain them to a friend who knows ML but has never had to serve a model. If a sentence sounds like a textbook, I failed and I'll fix it.
The shape of the thing: we start from what inference even is, get the product framing straight, then go deep on where the bottlenecks live (a GPU has exactly two things you can run out of, and knowing which one decides everything). Then hardware, software, the applied-research techniques that actually move numbers, the per-modality wrinkles, and finally the production reality of containers and autoscaling and cold starts. Math renders where the book has math. Each chapter ends with the kind of self-check questions an infra interviewer would actually ask, because that's how I revise.
What inference even is
Training is where you learn the weights. Inference is where you make those weights earn their keep. Everyone obsesses over the first one. The second is where the money and the misery live.
Here's the thing that took me embarrassingly long to internalize: a trained model is just a big file of numbers. It does nothing. Inference is the act of running a forward pass over that file to turn an input into an output, and doing it fast enough, often enough, and cheaply enough that a real product can stand on top of it. For most of ML history this was a non-event. You trained an XGBoost model, you pickled it, you called .predict() on a CPU, and you went home. Classic ML inference is genuinely easy.
Generative inference is not easy, and pretending otherwise is how teams light money on fire. You cannot just grab the weights, rent some GPUs, and expect a fast, reliable service to fall out. A 70B-parameter model in half precision is 140 GB before you've served a single token. The model is autoregressive, so generating 500 tokens means 500 sequential forward passes, each one waiting on the last. The thing that makes it feel instant to a user (streaming tokens out one at a time) is the exact thing that makes it a nightmare to schedule across thousands of concurrent requests. None of that shows up when you're prototyping in a notebook.
The three layers that all have to work
The cleanest way I've found to think about the whole field is as three layers, stacked, each one useless without the others.
The runtime is the bottom of the stack: one model running on one instance, squeezed for every drop of speed. This is batching, KV caching, quantization, speculative decoding, parallelism, disaggregation — the techniques that show up again in Chapter 5 and are most of what people mean when they say "inference optimization."
The infrastructure layer takes that fast single box and makes it a service: autoscaling across a cluster, spreading load over regions and even clouds, surviving a node dying at 3am without anyone noticing. A blazing runtime behind flaky infra is a demo, not a product.
The tooling layer is the abstraction you hand the people building on top. Too black-box (weights in, API out) and you can't tune anything; too raw (here are some CUDA primitives, good luck) and nobody ships. The art is sitting in the middle.
When something is slow, the first question isn't "which technique do I add." It's "which layer is this." A slow single request is a runtime problem. A service that's fast when idle and slow under load is an infra problem. If your engineers can't express what they need, it's a tooling problem. Diagnosing the layer first saves you from optimizing a kernel when your autoscaler is the thing on fire.
Why open models turned this into a field
You could ignore all of this and just call a hosted API, and for a lot of products you should. So why does dedicated inference exist as a discipline? Three reasons, and they're the same three axes the whole book optimizes along.
- Latency. Shared APIs are tuned for aggregate throughput across everyone's traffic, not for your one request being fast. When you control the deployment you can tune for your latency budget.
- Availability. A public API gives you maybe two nines and a rate limit that moves without warning. A dedicated deployment you control can hit four-plus nines because the failure modes are yours to engineer around.
- Cost. At scale, running your own optimized deployment is frequently 80%+ cheaper than per-token API pricing, because you're not paying someone else's margin on idle capacity.
Open-weight models (Llama, Qwen, DeepSeek, Mistral) are what made this reachable. The moment you can legally hold the weights, all three levers become yours. That's the whole reason inference engineering is a job instead of a billing question.
Drills — Chapter 0
Why is classic ML inference (say, gradient-boosted trees) easy while LLM inference is hard, in one sentence each?
A model serves a single request in 200 ms but P90 latency under production load is 4 s. Which of the three layers do you investigate, and why?
Name a case where the right answer is "don't do dedicated inference at all."
Prerequisites: the product thinking before you optimize
Optimization is not "make the number go up." It's picking the least-bad tradeoff among things that fight each other. If you skip this chapter you'll spend a week shaving 30 ms off a latency that nobody's budget cared about.
Latency, throughput, quality — pick your sacrifice
There are three quantities you're always trading: latency (how fast one user gets their answer), throughput (how many tokens the whole system pushes per second), and quality (how good the outputs are). You do not get to maximize all three. Bigger batches lift throughput but hurt per-user latency. Quantization lifts both but can dent quality. The job is finding the point on that surface your product actually needs.
The analogy that stuck with me: an NFL athlete isn't the maximal human, they're the specialized one. A lineman optimized for a 40-yard dash would be worse at being a lineman. Same here. A deployment tuned to win a throughput benchmark is often a bad fit for a latency-sensitive chat product. You're building a specialist for one job, not chasing a leaderboard.
More constraints make this easier, not harder. "Make it fast" is unanswerable. "P90 time-to-first-token under 400 ms, at 50 requests/sec, for under $X per million tokens, on inputs that average 2k tokens" is a spec you can actually engineer toward. Vague requirements are how you end up optimizing the wrong axis.
Know your requirements
Before touching a GPU I want four things pinned down: the application interface (streaming chat? batch jobs? a tool-call inside an agent that blocks until it returns?), the latency budget (what does "too slow" mean to a user), the unit economics (cost per request / per user / per month that keeps the product viable), and the usage pattern (steady? spiky at business hours? a few giant requests among many tiny ones?). Every later decision keys off these.
One axis that confuses people: shared vs dedicated is not the same as open vs closed. Shared inference bills per token on capacity you share with strangers. Dedicated inference bills per GPU-hour on capacity that's yours. You can run an open model on a shared endpoint, or pay for a dedicated deployment of one. The pricing model and the model's license are two different choices.
Model selection
Evals first, always. Pick the smallest model that clears your quality bar on your task, not the one topping a generic leaderboard. Two tools change the calculus:
- Fine-tuning buys domain quality at a smaller size. A text-to-SQL fine-tune of a few-billion-parameter model can match a 100B+ general model on that one task, and it's dramatically cheaper to serve. Specialization beats raw scale when the task is narrow.
- Distillation trains a small student to mimic a big teacher's full probability distribution, not just its final answers — the soft targets carry more signal than the hard labels. Worth knowing: the small and large models in a released family are usually trained independently, not distilled from each other. DeepSeek-R1's Llama/Qwen distillations are the famous exception, not the rule.
The metrics — this is the vocabulary for the whole guide
Get these exactly right, because every chapter after this leans on them.
- TTFT (time to first token). How long until the user sees anything. Gated by prefill, which is compute-bound. This is the metric a streaming chat UI lives and dies on.
- TPS (tokens per second) is ambiguous and you must say which one you mean. Perceived TPS is one user's stream speed; total TPS is the system's aggregate. ITL (inter-token latency) is the cleaner per-user number: 10 ms ITL = 100 tokens/sec/user. Reading speed is ~10 tok/s, so anything past that feels instant to a human.
- Non-streamed calls (an agent's tool call that blocks until the full JSON is back) don't care about TTFT — measure total response time instead.
- Percentiles, not means. Latency is right-skewed: a few slow requests drag the average somewhere no real request lives. Report P50/P90/P99 (the 1-in-2, 1-in-10, 1-in-100 slowest). Good perf work pulls in the tail, not the mean.
- Inference-only vs end-to-end. If the model is fast but the user-visible request is slow, your problem is infra (network, queueing, pre/post-processing), not the model. Always measure both so you know which one to chase.
Drills — Chapter 1
A teammate reports "average latency is 600 ms, we're fine." Why might you still have a problem?
For an agent that makes blocking tool calls (no streaming), which metric matters and which is irrelevant?
Inference benchmarks look great but users say the app is sluggish. First hypothesis?
Why does adding constraints to a spec tend to produce a better deployment?
Models: where the bottlenecks live
This is the chapter that actually changed how I think. Everything before it is framing; here we get to the physics. The punchline you can tattoo on your arm: prefill is compute-bound, decode is memory-bound. Once that clicks, half of inference optimization stops being a grab-bag of tricks and becomes obvious.
Neural nets, fast
A linear layer is a matrix multiply plus a bias: \( y = Wx + b \). That's it. Stack two of them and you've gained nothing — \( W_2(W_1 x) = (W_2 W_1)x \) is just one bigger linear layer, because the composition of linear maps is linear. The thing that saves you is the nonlinearity between them: ReLU, SiLU/Swish, SwiGLU. Slip a nonlinear function between the matmuls and the layers stop collapsing, so depth starts buying you real expressive power, while staying (mostly) differentiable so you can still train it.
Framing-wise: modern LLMs are decoder-only. The BERT-style encoder-only models that ran NLP in 2019 are rare now (they survive in embeddings, Chapter 6). Encoder-decoder still shows up when the input is a different modality than the output — Whisper turning audio into text, for instance.
LLM mechanics
An LLM generates one token at a time, autoregressively: each new token is conditioned on everything before it. A few mechanics people gloss over and then get bitten by:
- Tokenization is not a neural net. It's a deterministic string ↔ integer map (subword units), vocab usually 100k+. No model, no GPU, just a lookup. It runs before and after the network.
- The chat template is the exact string format that merges system/user/assistant roles, tool definitions, and multimodal placeholders into one flat sequence. It has to be implemented byte-for-byte right; a wrong template silently tanks quality because the model sees a distribution it was never trained on.
- Context window = input tokens + any reasoning tokens the model generates + output tokens. Reasoning models eat into the same budget, which is easy to forget when you size a deployment.
Generation has two phases with completely different cost profiles:
- Prefill processes the entire input prompt in parallel in one shot, building the KV cache. One big batched forward pass.
- Decode is the autoregressive loop: one forward pass per output token, each waiting on the last, each reading the whole KV cache.
Each decode step produces a logit vector the length of the vocabulary. Normalize it to probabilities and sample. Temperature scales the logits before the softmax (higher = flatter = more random); top-k keeps the k most likely tokens; top-p keeps the smallest set whose mass exceeds p. Temperature 0 (or top-k 1) is greedy and deterministic. You can also bias specific logits or constrain decoding to a grammar — that's how structured JSON and reliable tool calls get enforced.
Reading a model off config.json
You can infer an enormous amount from the architecture string. Parse Qwen3MoeForCausalLM: family (Qwen), version (3), it's a mixture-of-experts (Moe), and it's a causal language model (ForCausalLM) — left-to-right masked attention, as opposed to a masked/bidirectional LM. One architecture covers many sizes, fine-tunes, and LoRAs, which is exactly why a single optimized runtime can serve a whole family of variants.
The block structure: an embedding layer maps token IDs to vectors, then N identical transformer blocks (each = attention + a feed-forward MLP + normalization), then an LM head projects back to vocab logits. By parameter count, the FFN dominates, attention is second, and the norms and activations are rounding error. Worth remembering when you think about where the memory traffic goes.
Attention, properly
Scaled dot-product attention is the core operation:
$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V $$Each token emits a query \(Q\), a key \(K\), and a value \(V\). The query dotted against every key gives attention scores; softmax turns them into weights; the weighted sum of values is the output. Multi-head runs several of these in parallel on slices of the vector. LLMs use self-attention with a causal mask (a token may only attend to earlier tokens); image and multimodal models also use cross-attention (one sequence attending to another).
The scary fact: \(QK^{\top}\) is \(N \times N\) for sequence length \(N\), so attention is naively \(O(N^2)\). The thing that rescues you in practice is the KV cache. During decode, the keys and values for all previous tokens never change, so you store them and look them up instead of recomputing. The cache is built during prefill and read and grown by one on every decode step. With it, each decode step is linear in sequence length rather than quadratic — the single most important systems trick in LLM serving.
MoE changes the accounting: a router picks a few experts per token out of many, so the model has far more total parameters than active parameters per token. That decouples capacity from per-token compute and rewrites the throughput story (and the parallelism story in Chapter 5).
The bottleneck math — don't skip this, it's the best part
A GPU has exactly two resources you can run out of: compute (floating-point ops per second) and memory bandwidth (bytes per second moved between HBM and the compute units). At any moment one is saturated and the other is sitting idle. That idle one is your bottleneck, and which one it is determines every optimization decision.
Each chip has an ops:byte ratio — its break-even arithmetic intensity. For an H100 in FP16:
$$ \frac{989 \times 10^{12}\ \text{FLOP/s}}{3.35 \times 10^{12}\ \text{bytes/s}} \approx 295\ \text{ops/byte} $$So the H100 can do ~295 floating-point operations in the time it takes to move one byte. Arithmetic intensity is the matching property of your algorithm: total compute work ÷ total memory traffic, per kernel. Compare the two:
- Algorithm intensity below 295 → you finish the math before the bytes arrive → memory-bound.
- Algorithm intensity above 295 → the bytes are there but the math lags → compute-bound.
The roofline model draws this: a diagonal bandwidth ceiling that rises with intensity, meeting a flat compute ceiling at the ridge point. Left of the ridge you're memory-bound; right of it, compute-bound.
Now the headline result. Prefill loads each weight once and then does big matmuls against the whole prompt at once — lots of compute per byte loaded, high intensity, lands right of the ridge → compute-bound → it sets TTFT. Decode reloads the entire model's weights from HBM to generate one token via skinny vector-matrix mults — almost no compute per byte, low intensity, lands left of the ridge → memory-bound → it sets TPS. Decode spends its life waiting on memory while the tensor cores twiddle their thumbs.
Take \(d = 128\), \(N = 4096\), FP16 (2 bytes/element). The tensors: \(Q, K, V\) are \(N \times d\); the score and probability matrices \(S, P\) are \(N \times N\); output \(O\) is \(N \times d\). A \(4096 \times 4096\) FP16 matrix is ~32 MiB, so those \(N\times N\) intermediates dominate the traffic. Walk each line of attention as read → compute → write, sum the bytes for traffic and the FLOPs for work, and the ratio comes out to roughly 62 ops/byte. That's far under the H100's 295, so decode attention is solidly memory-bound. This is an intuition-building exercise, not something you compute daily — but doing it once makes the roofline stop being abstract.
Two consequences fall straight out. Batching makes decode less memory-bound: if you load the weights once and run many sequences' decode steps against them together, you amortize the byte cost over more compute, pushing intensity up toward the ridge. And image/video generation is compute-bound, the opposite of LLM decode — which is why its optimization playbook differs.
Image and video generation (lighter touch)
Diffusion models generate by iterative denoising: start from pure noise in a compressed latent space and, over many steps, subtract predicted noise until a coherent latent emerges, then decode it to pixels. An SDXL-era pipeline is text encoder → denoiser (UNet) → VAE decoder. Modern ones are heavier: Qwen-Image uses a full VLM as the text encoder, a ~20B denoiser, and a dual VAE. Few-step models (≤8 steps via latent consistency or distillation) cut 80–90% of the work for some quality loss.
Video adds a time axis: the latent is over X, Y, and T. You hold the whole clip and denoise every frame each step. Do it frame-by-frame instead and errors accumulate down the timeline (Self-Forcing and friends fight this). In practice that means batch size 1 on a full 8-GPU node, ~50 steps, enormous latents. The mental model I keep: video gen in 2025 feels like LLMs in 2023 — expensive, fast-moving, and about to get an order of magnitude cheaper.
Optimizing attention
Two roads. The first keeps the math identical and just runs it smarter (lossless); the second changes the math (lossy-ish, trades a little quality for a lot of speed).
Better implementations (lossless):
- FlashAttention fuses the attention kernels so the giant \(N \times N\) score matrix never gets written to HBM — it's computed in tiles in fast on-chip memory, killing redundant reads and writes. It's hardware-specific (H100 and B200 want different code) and shines for compute-bound prefill and video.
- PagedAttention stores the KV cache in non-contiguous fixed-size pages addressed through a lookup table, the way an OS pages virtual memory. That kills the fragmentation you'd otherwise get from variable-length sequences and lets you pack far more concurrent requests into the same VRAM. (This is vLLM's origin trick.)
Better algorithms (lossy-ish): sliding-window attention drops \(O(N^2)\) to \(O(Nw)\) by only attending within a window \(w\); gated, linear, and compressed variants approximate the full thing; multi-latent attention (MLA) shrinks the KV cache itself. And you can leave attention behind entirely: Mamba and other SSMs carry a recurrent state instead of an all-pairs comparison, and hybrids like Nemotron Nano interleave SSM and transformer layers to get most of the quality at a fraction of the KV cost.
Drills — Chapter 2
State why prefill and decode land on opposite sides of the roofline.
The KV cache turns attention from \(O(N^2)\) to linear per step. What exactly is cached, and when is it built vs. used?
You compute an algorithm's arithmetic intensity as 62 ops/byte on a chip whose ops:byte is 295. Bound by what? How might you move it?
FlashAttention and PagedAttention are both "lossless." What does each actually optimize, and are they competing?
Why does batching reduce decode's memory-bound-ness, and where's the limit?
Hardware: the GPU mental model
You don't need to write CUDA to do inference well, but you do need a true picture of the machine, or the roofline stays a cartoon. The one-liner: a CPU is built to do one thing after another quickly; a GPU is built to do the same thing to thousands of numbers at once. Matmuls are exactly that shape, which is why they live on GPUs.
Compute units
A GPU is a grid of streaming multiprocessors (SMs), and each SM packs three kinds of cores:
- CUDA cores — scalar/vector arithmetic, the general-purpose workhorses.
- Tensor cores — dedicated matrix-multiply-accumulate (MMA) units. This is the part that matters for inference; when a spec sheet quotes "989 TFLOP/s FP16," it means Tensor-core FLOPs. Measure inference compute here, not in CUDA-core numbers.
- SFUs (special function units) — transcendentals like sin/cos/exp/log, which is where the
expin your softmax actually runs.
Memory hierarchy
Memory is a pyramid: tiny-and-instant at the top, huge-and-slow at the bottom. Registers → shared memory / L1 (per-SM, on-chip) → L2 (shared across SMs) → HBM (VRAM), the big pool off-chip. Every level up the pyramid trades capacity for bandwidth. The reason decode is memory-bound is right here: those model weights live in HBM, and HBM bandwidth — fast as it is in absolute terms — is the wall decode keeps slamming into. FlashAttention is fast precisely because it does its work up in SRAM instead of round-tripping to HBM.
Generations and SKUs
The naming is a mess, so here's the map. NVIDIA datacenter generations: Hopper (H100, H200), Ada Lovelace (L4, L40S, plus the RTX line), Blackwell (B200, GB200), with Rubin next. The Grace and Vera CPUs pair with GPUs into "superchips" (Grace-Hopper, Grace-Blackwell) sharing fast coherent memory. Disambiguate the tiers: datacenter (H100, B200), workstation (RTX Pro 6000), and consumer (RTX 5090) cards differ enormously in memory, interconnect, and whether you're even licensed to run them in a datacenter.
Instances and interconnects
Real deployments are multi-GPU nodes, and how those GPUs talk decides your parallelism strategy. NVLink / NVSwitch connect GPUs within a node at high bandwidth; InfiniBand connects nodes to each other, slower. Both are a fraction of a single GPU's VRAM bandwidth — and that gap is the entire reason multi-GPU inference has to be topology-aware. Moving data between GPUs is expensive relative to feeding one GPU, so you design to minimize it.
MIG (multi-instance GPU) goes the other way: slice one big GPU into isolated hardware partitions, each with its own memory and compute, so a small model that can't saturate an H100 gets a right-sized slice instead of wasting the whole card.
A few orientation notes for the rest of the landscape. Cloud vs on-prem is a capital-vs-flexibility call. Among clouds, the hyperscalers (AWS/GCP/Azure) sell everything; the neoclouds (CoreWeave, Nebius, and friends) specialize in GPUs and often win on price and availability. Non-NVIDIA accelerators exist — Google TPUs, AWS Trainium/Inferentia — with their own software stacks, and at the small end there's genuine local inference on desktops and phones. All real, all mostly out of scope for a GPU-serving deep dive.
Drills — Chapter 3
Why do we quote inference compute in Tensor-core FLOPs rather than CUDA-core FLOPs?
NVLink and InfiniBand are both "fast." Why does the distinction dominate multi-GPU design?
You're serving a 3B model that uses 15% of an H100. What hardware feature helps, and what's the tradeoff?
Software: from CUDA up to inference engines
Almost nobody writes the kernels. What you actually do is pick the right engine and configure it well. But you can't pick well without knowing what's underneath, so we climb the ladder from kernels to orchestrators.
CUDA and kernels
A kernel is a function that runs on the GPU. Two ideas matter at this level even if you never write one. Kernel selection: for a given matmul there are many implementations, and the fastest depends on the exact shape, dtype, and chip — a good engine picks the right one for your situation. Kernel fusion: combine several ops into one kernel so intermediate results stay in fast on-chip memory instead of bouncing to HBM and back. Fusing an activation into the preceding matmul, for example, saves a full HBM round-trip. Cutting round-trips is the whole game for memory-bound work.
Frameworks and formats
PyTorch is the substrate everything sits on. Weights ship as files: safetensors (safe, the default now) over pickle (can execute arbitrary code on load — avoid), and GGUF for the llama.cpp/local world. For graph-compiled paths there's ONNX Runtime and TensorRT; for quick model and pipeline access there's Hugging Face Transformers and Diffusers. These are the building blocks; the engines below wrap them into something you'd actually serve.
Inference engines — compare them honestly
This is where most of your leverage is. The honest comparison:
| Engine | What it's best at | The catch |
|---|---|---|
| vLLM | Origin of PagedAttention; broad model support; fast to stand up; the safe first choice for new or fragmented architectures, including VLMs. | Not always the absolute fastest once a model is stable and you'd compile it. |
| SGLang | Structured and agentic workloads; RadixAttention reuses shared prefixes across requests aggressively. | Smaller ecosystem; shines most when your traffic has heavy prefix sharing. |
| TensorRT-LLM | Top-end performance via a hardware-specific compile step. A config-and-benchmark workflow. | The engine build takes minutes and is pinned to an exact GPU/CUDA combo — you must cache the built engine and rebuild on any change. |
| NVIDIA Dynamo | Orchestration above the engines: large-scale distributed serving and disaggregation. | It's a layer, not a single-box engine — overkill until you're at real scale (sets up Ch 5). |
Standing one up is genuinely a one-liner, which is part of why vLLM is the default starting point:
vllm serve Qwen/Qwen3-8B \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--enable-prefix-caching
Every flag there is a Chapter 5 technique in disguise: tensor parallelism, context length (KV budget), KV-cache quantization, prefix caching. The engine is where the theory becomes a config value.
Benchmarking vs profiling
These get used interchangeably and they're not the same thing. Benchmarking produces a number: "P90 TTFT is 350 ms at 40 req/s." Profiling tells you why that number is what it is, op by op (PyTorch Profiler for the framework view, Nsight Systems for the timeline, Nsight Compute for a single kernel). Most engineers only ever benchmark, and that's correct — you reach for a profiler when you're writing kernels, contributing to vLLM/SGLang, or bringing up a brand-new modality.
Benchmarks lie when they're unrealistic. Replay real traffic shapes (input/output length distributions, arrival bursts), separate warm from cold runs, and fix your seeds. A number from a single fixed-length prompt at concurrency 1 will look beautiful and tell you nothing about production.
Drills — Chapter 4
You're bringing up a freshly released MoE VLM nobody has optimized yet. Which engine, and why not TensorRT-LLM on day one?
What does kernel fusion buy you, in roofline terms?
When is profiling (not benchmarking) the right tool?
Techniques: the applied-research chapter
This is the fun part and the part with the most rope to hang yourself. Two framing rules carry the whole chapter: more constraints unlock more perf, and more traffic makes more techniques worth their complexity. You don't reach for tensor parallelism across eight GPUs or dynamic disaggregation to serve a hundred requests a day.
And the techniques interact. Some are symbiotic — KV-cache quantization frees memory that makes disaggregation cheaper. Some conflict — bigger batches eat the spare compute that speculative decoding needs to do its job. There's a Baseten story about grinding through 77 configurations to land a 2× TPS win, and that's the honest texture of this work: patient, empirical, one variable at a time. There's no config that's optimal in general, only optimal for your spec.
Quantization
Drop the numeric precision of the model: BF16/FP16 down to FP8, even FP4. The payoff is double-barreled. Prefill gets roughly 2× the Tensor-core FLOPs at lower precision (it's compute-bound, so that's free speed). Decode effectively doubles its memory bandwidth, because each weight is half the bytes to move (it's memory-bound, so that's exactly the resource it was short on). Real-world gains run ~30–50% per precision level, not a clean 2× — there's always overhead.
The risk is compounding precision error. Think of \(\pi\): keep it as 3.14159 and \(\pi^3\) is basically right; round to 3.14 and the error grows when you cube it; round to 3 and \(\pi^3 \approx 27\) versus the true ~31 is a disaster. Low-precision errors compound the same way as they flow through layers and, worse, token-to-token through the KV cache.
Number formats matter more than the bit count suggests. INT8 vs FP8 (the E4M3 and E5M2 variants trade mantissa for dynamic range); FP4 and MXFP4; microscaling (MX) formats that attach shared scales to small blocks. And the scale granularity is a dial: per-tensor (coarse, cheap) → per-channel → per-group (fine, accurate). Finer scales preserve more dynamic range at a little overhead.
So there's a risk gradient: weights and activations tolerate low precision well; the KV cache is moderately sensitive; attention is the danger zone — it's sensitive to dynamic range and its errors compound token-to-token, so you generally keep the softmax in full precision. Approaches range from weights-only (safest) to weight-plus-activation (more gain, more risk), and you usually leave the input and output layers at original precision because they're disproportionately sensitive.
Measure quality like you mean it: perplexity for a cheap signal, real benchmarks (MMLU, SWE-bench) for capability, and custom evals on your task for what actually matters. The target is a quality delta indistinguishable from noise. Quantization is a dial, not a switch — and if you genuinely cannot risk any quality, note that everything else in this chapter is lossless.
Speculative decoding
Remember decode's idle tensor cores? Speculative decoding spends them. The idea: a cheap speculator drafts several tokens ahead, then the real target model validates the whole draft in a single parallel forward pass, accepts the longest correct prefix, and tacks on its own next token. Best case you emit \(N+1\) tokens for the price of one decode step. The analogy that makes it click: a sudoku is hard to solve but trivial to check — drafting is the guess, validation is the cheap check.
Crucial property: it only helps TPS / ITL, never TTFT (prefill is unchanged). Three levers govern whether it's a win: draft cost, draft length, and acceptance rate. Acceptance is high for the first drafted token and decays the deeper you draft, and one rejection throws away the rest of that draft. Higher temperature hurts acceptance (more entropy = harder to guess), subject matter matters, and — this is the gotcha — you turn it off at high batch sizes, because then there's no spare compute to spend; the tensor cores are already busy.
The algorithm zoo, weakest to strongest:
- Draft-target: a small off-the-shelf model drafts. Simple, but CPU round-trips and mediocre acceptance limit it.
- Medusa: bolt 2–4 extra decoder heads onto the target to draft in parallel. Cheap, but limited draft depth.
- EAGLE: a purpose-built draft module that ingests the target's hidden states (mid/late layer), under 1B params, drafts up to ~8 tokens with high acceptance, and fuses into the same module so there's no CPU round-trip. The current go-to when you can train the heads.
- N-gram / Lookahead: no draft model at all — build an n-gram dictionary from the context during prefill and match prefixes to suffixes. Long drafts, and it only wins when output closely echoes input (code completion), where it actually beats EAGLE.
Caching / prefix caching
The KV cache isn't just reusable within one request — it's reusable across requests that share a prefix. Two prompts that both start "You are a helpful assistant… The weather in " share all the KV up to the first differing token, so the second request skips recomputing that prefill entirely. The shared prefix ends at the first token that differs, which has a sharp practical implication: put the variable stuff last. Stable system prompt and tools first, user-specific text at the end, and you maximize cache hits. That's a context-engineering decision with a real latency payoff.
The big winners are long system prompts, agent scaffolds, RAG context, code context, and multi-turn chat — anywhere a large stable chunk leads every request. This is literally why APIs bill cache-hit tokens cheaper. At the infra layer it pairs with cache-aware routing: send a request to the replica that already holds its prefix, instead of a random one that would recompute it. For prompts that outgrow VRAM there's long-context handling — offloading cold KV to CPU/host memory and paging it back.
Model parallelism
First, sizing. A rough floor for GPU count is precision (bytes) × parameters × a KV-cache allowance, rounded up to an instance size. E.g. a 70B model in FP16 is ~140 GB of weights plus KV headroom, so it won't fit one 80 GB card — you need multiple, and often more than the minimum to hold bigger KV or hit a latency target. The thing that makes this hard is the same gap from Chapter 3: inter-GPU communication is far slower than VRAM, so you choose a split that minimizes crossing. Three forms:
- Pipeline parallelism (PP) — split the layers across GPUs, like an assembly line. Cheap on communication but plagued by pipeline bubbles (GPUs idle waiting for the stage before them), so latency and utilization suffer. Mostly a multi-node tool.
- Tensor parallelism (TP) — split each layer's tensors across GPUs. Needs an all-reduce sync every layer, which is chatty, so it wants the fast NVLink fabric inside one node. Great intra-node latency; your default. Doesn't cross node boundaries gracefully.
- Expert parallelism (EP) — for MoE, shard whole experts across GPUs. Communication is just token routing, not a per-layer all-reduce, so it's lighter and scales across nodes. Throughput-oriented. EP8 over 8 GPUs with 128 experts puts 16 experts on each.
In practice you mix: TP for the dense attention layers, EP for the sparse MoE layer. Across nodes over InfiniBand you keep the chatty part local — TP within a node, PP across nodes (e.g. TP8PP2) for dense models, or EP16 for MoE. TP8PP2 minimizes per-user latency; EP16 maximizes throughput. And often the right answer isn't naive multi-node at all but horizontal replicas or disaggregation.
Disaggregation
This is the technique the whole roofline discussion was building toward. Prefill is compute-bound, decode is memory-bound, and under load they fight each other for the same GPU's resources. So split them onto separate engines/GPUs. Three steps: prefill builds the KV cache and the first token → ships the KV over the interconnect → decode does the rest. Conditional disaggregation is the smart default: a request hits the decode engine first, and if it's short or already cached it's handled locally, otherwise it's shipped to a prefill engine.
When is it worth the complexity? High volume (~100M–1B tokens/day), large models (~100B+ params), and prefill-heavy/long-input workloads — code editors with huge context are the textbook case. Dynamic disaggregation (NVIDIA Dynamo) makes it adaptive: a prefill queue, routing decisions based on input-length-after-prefix and queue depth, and NIXL-based KV transfer that transposes layout across differing TP configs. The notation is xPyD — 5P3D means 5 prefill workers, 3 decode workers, and you can retune the ratio at runtime as traffic shifts.
It's not free: you invent new bottlenecks. The prefill queue can back up, and the decode side can exhaust its KV memory — which loops back to KV-cache quantization and offloading as the fixes. Symbiotic techniques, exactly as promised.
Drills — Chapter 5
Why does quantization help compute-bound prefill and memory-bound decode through two different mechanisms?
Why must speculative decoding be disabled at high batch sizes?
You move user-specific text to the front of every prompt. What did you just break?
Dense 70B over two nodes: why TP8PP2 rather than TP16?
When is disaggregation the wrong call, and what new failure modes does it add when it's right?
Modalities: LLM tricks, generalized
Good news: you already learned most of this. Nearly every modality is one of two shapes, and one of those shapes is just "an LLM wearing a hat."
Two archetypes
Almost everything is either autoregressive token generation (LLMs, VLMs, speech-to-text, text-to-speech, embeddings) or iterative denoising (image and video). If it's autoregressive, basically every engine and technique from Chapters 2 and 5 carries over — KV cache, batching, quantization, speculation, the works. What changes per modality is mostly the metric: for TTS you care about time-to-first-word or first-sentence, not first token.
| Modality | Input → output | Archetype |
|---|---|---|
| LLM | text → text | autoregressive |
| VLM | image+text → text | autoregressive |
| Embeddings | text → vector | single forward pass |
| ASR | audio → text | autoregressive (enc-dec) |
| TTS | text → audio | autoregressive |
| Image / video | text → pixels | iterative denoising |
VLMs
A vision-language model is an LLM with a small vision encoder bolted on the front. The size asymmetry is wild: Mistral Large 3 pairs a ~2B vision encoder with a ~673B LLM. The encoder is tiny but fragmented — every VLM family does it slightly differently — which is exactly why flexible engines like vLLM and SGLang matter; they absorb that fragmentation so you don't hand-roll it.
The real cost is tokens. A high-res image becomes ~1000 visual tokens fed into the LLM's context. So the VLM challenge isn't new math, it's longer inputs and bigger KV caches — and every Chapter 5 technique applies directly. The new dial is downsampling: trading image resolution for token count (hi-res can be ~4× the tokens), which becomes a serious quality/speed knob the moment you do multi-image or video. Omni-modal models that take everything in and emit multiple modalities are the natural extension.
Embedding models
The odd one out, and refreshingly simple. Encoder-only, bidirectional attention (no causal mask — it can see the whole input at once), one forward pass, no decode loop. Output is a single vector. Because there's no autoregression, serving is purely throughput-oriented: batch hard. Matryoshka embeddings let you truncate the vector to a shorter dimension with graceful quality loss, so one model serves multiple size/quality points.
ASR and TTS
ASR (Whisper-style encoder-decoder) splits by use case: real-time/short clips care about latency, while long-file transcription is a pipeline problem — chunk the audio, use voice-activity detection (VAD) to find speech, overlap chunks so you don't cut words. Diarization ("who spoke when") is a separate layer on top.
TTS flips it: tokens → audio, with the vocabulary expanded to include waveform/acoustic tokens. Streaming matters (time-to-first-audio is the metric), and speech-to-speech models close the loop for real-time conversation.
Image and video generation
Image gen leans on kernel optimization plus the "one weird trick" family — caching or skipping denoising steps when consecutive steps barely change. Video is where it gets brutal: attention is the dominant cost because the latent spans X, Y, and T, so you throw quantization and attention optimization at it, and you add context parallelism to split that enormous latent across GPUs. It's the one place the LLM toolkit needs genuinely new moves.
Drills — Chapter 6
Why does serving a VLM feel like serving a long-context LLM?
Why can embedding models be batched far more aggressively than LLMs?
What new parallelism does video generation need that LLM serving doesn't, and why?
Production: runtime is necessary, not sufficient
A perfect runtime that takes four minutes to cold-start, can't survive a node death, and can't scale with traffic is a science project. This chapter is everything between "the model is fast" and "the service is real."
Containers and dependencies
You ship in Docker, and you start from the official vLLM or SGLang base image rather than building the CUDA stack yourself. The thing that will actually hurt you is dependency management. The chain is long and brittle: GPU driver + CUDA + cuDNN, then torch + transformers + diffusers, then the engine, then system packages like ffmpeg — and any version slipping out of alignment breaks the whole thing in a way that's miserable to debug. Pin everything (uv or poetry). For day-zero support of a just-released model you'll run nightly or pre-release builds, then rebuild on the stable release once it lands.
NIMs (NVIDIA Inference Microservices) are prepackaged, model-specific containers — great as a base, a reference, or an out-of-the-box option, but you build your own when you want maximum control over the config.
Autoscaling
Production runs on Kubernetes: a control plane that decides and a worker plane of instances running your replicas. The subtle part is what signal you scale on. Utilization is lagging (by the time GPUs are pegged you're already late); traffic is proactive (scale on incoming requests before they hurt). They diverge — a handful of huge-prefill requests can pin utilization without much request volume, and a flood of tiny requests can spike traffic without much utilization. You watch both. Five knobs do the tuning: min replicas, max replicas, autoscaling window, scale-down delay, and concurrency target. Concurrency and batch sizing sit underneath all of it.
Cold starts — the one that gets everyone
This is the hardest production problem, because a "scale up" isn't instant. Three segments, each slow for its own reason:
- Node provisioning — getting a GPU instance from the cloud. Cloud-dependent and genuinely negotiable in your contract.
- Loading weights — pulling hundreds of GB into VRAM. Make it smaller (quantization) or make the pipe fatter: load from a same-datacenter cache, not Hugging Face or S3 across the internet, where egress and distance kill you.
- Engine start — vLLM/SGLang start fast; TensorRT-LLM and anything with
torch.compiletake minutes, so cache the built engines and make sure the cache matches the exact GPU/CUDA/deps or it's silently invalid.
Routing, load balancing, queueing. A router answers "where should this go" (it has opinions — send to the replica holding this request's KV prefix, or the one with the right LoRA loaded); a load balancer answers "where could this go" (spread the load). Queues are FIFO or priority, and you hand a freshly-started replica queued work carefully so it doesn't get stampeded before it's warm.
Scale-to-zero drops to no replicas when idle. It needs fast cold starts and a robust queue, and it's great for dev, bursty, or business-hours workloads. But needing it for latency-sensitive light traffic is a smell — if a trickle of requests must each be fast, a hosted API is the better answer than eating a cold start on every one.
For multi-model pipelines (VAD → ASR → LLM), scale each component independently — they have different load profiles. But keep them in one cluster: intra-cluster hops are ~10 ms, cross-cluster ~50 ms, and for a real-time voice pipeline that 40 ms difference, multiplied across stages, quietly eats your entire latency SLA.
Multi-cloud capacity management
Siloed multi-cloud — "we run separate stacks on AWS and CoreWeave" — is easy to set up and bad. True multi-cloud treats every GPU pool across regions and providers as fungible capacity behind one bin-packer with a global view that self-heals around failures. The supporting pieces: GPU procurement strategy (reserved vs on-demand vs spot, and the contracts behind them), geo-aware load balancing (run near users for latency), redundancy across regions/clouds for reliability, and security/compliance (air-gapped deployments, data residency rules).
Deployment, observability, clients
Ship with zero-downtime rolling deploys. Do real cost estimation — per-request, per-user, per-month unit economics, the numbers that decide whether the product survives. And invest in observability: GPU metrics, token metrics, distributed traces, and crucially the inference-only-vs-end-to-end split from Chapter 1, so when something's slow you instantly know whether it's the model or the plumbing. On the client side, mind client-side overhead, use async inference where you can, and support streaming protocols (SSE, WebSockets, gRPC) so the fast TTFT you worked for actually reaches the user.
Drills — Chapter 7
You autoscale purely on GPU utilization and still drop requests during spikes. Why, and what do you add?
Cold starts are dominated by weight loading. Two independent levers?
A team wants scale-to-zero for a low-traffic, latency-critical endpoint. Push back?
Why keep a VAD → ASR → LLM pipeline in a single cluster even though you scale each part separately?
How I'd actually approach a new model
If I had to compress the whole book into a decision tree I'd run top to bottom, gating each step on the traffic and model-size thresholds from Chapter 5, it's this.
- Do you even need dedicated inference? Low or bursty traffic, no special latency/cost/availability constraint → use a hosted API and stop here. Don't pay GPU-hours to serve a trickle.
- Pin the spec. Interface, latency budget, unit economics, traffic shape. The more constraints, the better the result. Pick the smallest model that passes your evals; fine-tune before you reach for a bigger one.
- Pick an engine. vLLM/SGLang to stand up fast and for new or fragmented architectures; TensorRT-LLM later when the model's stable and throughput is king; Dynamo once you're at distributed scale.
- Quantize and measure. Drop precision a level, run real evals, confirm the quality delta is noise. Free-ish speed on both prefill and decode.
- Add caching and speculation. Prefix caching if requests share long prefixes (system prompts, agents, RAG). Speculative decoding (EAGLE) if you're TPS-bound at low-to-moderate batch — and remember to disable it once batches get big.
- Parallelize only when you must. Won't fit one GPU, or need lower latency / bigger KV? TP within a node by default; EP for MoE; PP only to bridge nodes.
- Disaggregate at scale. ~100M–1B tokens/day, ~100B+ params, prefill-heavy inputs → split prefill and decode (xPyD), then manage the new prefill-queue and decode-KV bottlenecks.
Every arrow is gated. You don't disaggregate a 7B model serving 1k requests/day any more than you call a hosted API to serve a billion tokens. Match the technique to the traffic, measure everything, change one variable at a time.
Glossary
- TTFT
- Time to first token. Gated by compute-bound prefill; the metric a streaming UI lives on.
- ITL
- Inter-token latency. 10 ms ITL = 100 tokens/sec for one user.
- TPS
- Tokens per second — say whether you mean perceived (per-user) or total (system).
- Prefill
- Processing the whole input in parallel and building the KV cache. Compute-bound.
- Decode
- The autoregressive loop, one token per forward pass. Memory-bound.
- KV cache
- Stored keys/values of past tokens, so attention is linear per step instead of \(O(N^2)\).
- ops:byte
- A chip's compute-to-bandwidth ratio; the break-even arithmetic intensity (~295 on an H100 FP16).
- Arithmetic intensity
- An algorithm's compute ÷ memory traffic. Below the chip's ops:byte → memory-bound; above → compute-bound.
- Roofline
- The chart of bandwidth ceiling + compute ceiling meeting at the ridge point.
- FlashAttention
- Fused, tiled attention that never spills the \(N\times N\) matrix to HBM. Lossless, hardware-specific.
- PagedAttention
- KV cache stored in non-contiguous pages via a lookup table; kills fragmentation. (vLLM.)
- MoE
- Mixture of experts; a router activates a few experts per token, so total params ≫ active params.
- MLA
- Multi-latent attention; compresses the KV cache.
- Mamba / SSM
- Recurrent state-space alternative to attention; hybrids interleave it with transformer layers.
- Quantization
- Lowering numeric precision (FP16 → FP8/FP4) for speed; risk is compounding precision error.
- MX / microscaling
- Number formats attaching shared scales to small blocks for better low-precision dynamic range.
- Speculative decoding
- Draft ahead cheaply, validate in parallel, accept the valid prefix + 1. Helps TPS, not TTFT.
- EAGLE
- Draft module that ingests the target's hidden states; sub-1B, high acceptance, fused (no CPU round-trip).
- Acceptance rate
- Fraction of drafted tokens the target keeps; decays with draft depth.
- Prefix caching
- Reusing KV across requests that share a leading prefix. Put variable content last.
- TP / EP / PP
- Tensor (split tensors, intra-node default), expert (shard MoE experts, multi-node), pipeline (split layers, bubbles) parallelism.
- all-reduce
- The per-layer sync tensor parallelism needs; why TP wants fast NVLink inside one node.
- Disaggregation
- Running prefill and decode on separate pools because their bottlenecks conflict.
- xPyD
- Disaggregation ratio notation — 5P3D = 5 prefill workers, 3 decode workers, retunable at runtime.
- NIXL
- The KV-transfer mechanism (in Dynamo) that ships and re-lays-out KV between prefill and decode.
- MIG
- Multi-instance GPU; slice one GPU into isolated hardware partitions for small models.
- SM
- Streaming multiprocessor; the GPU's compute block, holding CUDA/Tensor/SFU cores.
- Tensor core
- The matrix-multiply-accumulate unit; where inference FLOPs are measured.
- HBM
- High-bandwidth memory (VRAM); the off-chip pool whose bandwidth caps decode.
- NVLink / InfiniBand
- Fast intra-node and slower inter-node GPU interconnects; both well below VRAM bandwidth.
- NIM
- NVIDIA Inference Microservice; a prepackaged model-specific serving container.
- Engine
- vLLM / SGLang / TensorRT-LLM — the layer that turns weights into a fast served model.
- Cold start
- Time from "scale up" to "serving": provision + load weights + start engine.
- Scale-to-zero
- Dropping to no replicas when idle; needs fast cold starts; bad for latency-critical light traffic.
- Percentiles (P50/P90/P99)
- Tail latency measures; the mean lies because latency is right-skewed.
- Distillation
- Training a small student on a teacher's full probability distribution, not just its outputs.
- Diffusion
- Image/video generation by iterative denoising from noise in latent space. Compute-bound.
- Context parallelism
- Splitting a huge (often video) latent across GPUs.
These notes are built from Philip Kiely's Inference Engineering (Baseten Books, 2026) — the book I worked through to learn this. The explanations here are my own; any errors are too.