7  Model serving

7.1 Preview

In this chapter, we move from model training to model serving. We will examine how to deliver predictions in online mode (low latency) and batch mode (high throughput), while balancing speed, quality, and cost.

We will cover three sets of decisions:

  • model-level decisions, such as model choice, graph optimization, and reduced precision or quantization
  • inference-system decisions, such as batching, queueing behavior, and scaling with replicas or GPUs
  • infrastructure decisions, such as placement (edge vs cloud) and hardware tier

7.2 Balancing fast, good, and cheap inference

In model serving, we usually care about three outcomes at the same time: speed, quality, and cost. We can improve any one of them, but we rarely improve all three together without tradeoffs.

First, we should define what “fast” means based on inference mode. In online inference, fast means low latency (for example, p95 response time). In batch inference, fast means high throughput (for example, records per second or jobs completed per hour).

Next, “good” usually means model quality on the task metric we actually care about, such as accuracy, recall, mean absolute error, etc.

“Cheap” means infrastructure cost, especially the cost of compute instances used for inference.

Now we can make the tradeoff concrete:

  • In an online recommendation API, we may replace a small ranker with a larger model that improves NDCG. Quality goes up, but the larger model may increase p95 latency and require more GPU replicas to stay within the latency target. In that case, we improved “good,” but either “fast” or “cheap” (or both) gets worse.
  • In a nightly fraud scoring batch job, we may switch to larger instances and wider parallelism to finish before 6:00 AM. Throughput goes up and the pipeline finishes on time, but infrastructure cost rises. In that case, we improved “fast” for batch mode, while “cheap” got worse.
  • In a cost-reduction push, we may apply aggressive quantization and run on cheaper hardware. Cost can drop substantially, but we may observe a measurable decrease in quality. In that case, we improved “cheap,” but “good” became harder to maintain.

So, instead of trying to maximize one metric in isolation, we should set explicit targets for all three:

  • a speed target (latency target for online, throughput target for batch),
  • a quality floor (minimum acceptable task metric),
  • and a cost budget (per request, per batch run, or per month).

Then we evaluate candidate designs against those targets. A model that is slightly less accurate but much faster and cheaper may create more total product value than a model that is marginally better but slower or more expensive.

To manage the tradeoff between “fast”, “good”, and “cheap”, we have three major elements we can control: the model artifact itself, the system from which we serve it, and the infrastructure that we deploy the system in. In the rest of this section, we will discuss specific techniques for balancing this tradeoff in the context of each of these three elements.

7.3 Model

7.3.1 Model choice

In practice, the model itself is often the single biggest driver of serving behavior. A model that needs 10x more compute per prediction will usually stay expensive, even if we tune everything else perfectly.

For many tasks today, we start by choosing among pretrained or foundation models. This choice strongly affects all three parts of our tradeoff triangle:

  • “Fast”: larger models usually take longer to run each prediction and need more memory movement.
  • “Good”: larger models may improve task quality, especially on harder or long-tail examples.
  • “Cheap”: larger models often require more powerful instances, more replicas, or both.

As a concrete example, imagine a text classification API that currently uses a small logistic regression model on CPU instances. If we switch to a large foundation model which we fine-tune for our task, we may gain a few points of F1 score, but we may also miss latency targets unless we move to GPU instances. That quality gain can be worth it in some products, but we should make the cost and latency consequences explicit.

The same pattern appears in other models. In gradient-boosted trees or random forests, increasing the number of trees or tree depth can improve accuracy, but it also increases inference work per request.

For neural networks, the main capacity knobs have similar effects. More layers, wider hidden dimensions, longer sequence length, larger embedding tables, and more complex feature interactions all increase compute and memory pressure at serving time. We often get better quality from these changes, but the cost is usually higher p95 latency (online) or lower throughput (batch), plus higher infrastructure spend.

So, in model serving, we will usually compare model variants as complete tradeoff packages instead of looking at quality alone. For each candidate model, we should evaluate the model in isolation (e.g. not integrated into an overall serving system), and record:

  • task quality metric (for example, accuracy, recall, F1, mean absolute error),
  • serving speed metric (p50 and p95 latency for online inference, throughput for batch),
  • and cost metric (for example, cost per 1k predictions, or per batch job).

(Note that the latter two can be evaluated without actually training a model; the service speed and cost is the same for a model with random weights as it is for a model with tuned weights.)

This gives us a practical selection rule: we can identify the candidate models that are feasible in terms of serving speed and cost, and then select the one with the best task quality.

7.3.2 Compiling and optimizing the graph

When we are developing and training machine learning models, we usually run a model in eager mode. We call the model, each operator runs immediately, and we get outputs. This mode is used in development because it is easier to use and debug: we can inspect tensors step by step, use normal Python control flow, and quickly test changes.

But for serving, we often switch to graph mode. In graph mode, the framework captures the computation structure (operators, tensor shapes, dependencies, and parameters) as a graph artifact.

flowchart LR
    E("**Eager mode**<br/><i>framework code + model weights</i>")
    G("**Graph mode**<br/><i>compiled computation graph</i>")
    O("**Optimized graph**<br/><i>fusion, layout, kernel selection</i>")
    R("**Runtime execution plan**<br/><i>execution provider schedules graph</i>")

    E --> G
    G --> O
    O --> R

    classDef box fill:#e5e7eb,stroke:#222,stroke-width:1px,color:#111;
    class E,G,O,R box;

This compile step - where we create the graph representation - can happen AoT or JiT (in which case, it may take much longer for the first request but subsequent requests should be faster).

Once the model is compiled into a graph, the runtime does not need to execute exactly the same operator sequence in the same order as eager mode; it only needs to produce a result that is computationally equivalent to the graph definition. For example, the runtime may fuse many operations into an equivalent kernel. Here is an example of a graph fusion from NVIDIA’s technical blog:

Original graph

%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 22, 'padding': 0}}}%%
flowchart TD
    C[" concat "] --> N[" next input "]
    V1c[" 1x1 conv "] --> V1b[" bias "] --> V1r[" relu "] --> C
    V2ac[" 3x3 conv "] --> V2ab[" bias "] --> V2ar[" relu "] --> C
    V2bc[" 1x1 conv "] --> V2bb[" bias "] --> V2br[" relu "] --> V2ac
    V3ac[" 5x5 conv "] --> V3ab[" bias "] --> V3ar[" relu "] --> C
    V3bc[" 1x1 conv "] --> V3bb[" bias "] --> V3br[" relu "] --> V3ac
    V4ac[" 1x1 conv "] --> V4ab[" bias "] --> V4ar[" relu "] --> C
    V4b[" max pool "] --> V4ac
    IC[" concat<br/>input "] --> V1c
    IC --> V2bc
    IC --> V3bc
    IC --> V4b

    classDef op fill:#78b900,stroke:#4a7a00,stroke-width:1px,color:#ffffff;
    class N,C,IC,V4b,V1c,V1b,V1r,V2ac,V2ab,V2ar,V2bc,V2bb,V2br,V3ac,V3ab,V3ar,V3bc,V3bb,V3br,V4ac,V4ab,V4ar op;

Vertical fusion

%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 60, 'padding': 0}}}%%
flowchart TD
    C["concat"] --> N["next input"]
    V1["1x1 CBR"] --> C
    V2a["3x3 CBR"] --> C
    V2b["1x1 CBR"] --> V2a
    V3a["5x5 CBR"] --> C
    V3b["1x1 CBR"] --> V3a
    V4a["1x1 CBR"] --> C
    V4b["max pool"] --> V4a
    IC["concat<br/>input"] --> V1
    IC --> V2b
    IC --> V3b
    IC --> V4b

    classDef op fill:#78b900,stroke:#4a7a00,stroke-width:1px,color:#ffffff;
    class N,C,V1,V2a,V2b,V3a,V3b,V4a,V4b,IC op;

Figure: Original graph, and the optimized version with vertical fusion.

To make the original graph more efficient, an optimizer would scans each branch from input (on the bottom) toward next input (on the top) and looks for contiguous operator chains that can be legally merged. In this example, each conv -> bias -> relu chain becomes one fused CBR node because the sequence of three operations can be scheduled as one composite kernel stage. This is called vertical fusion, when sequential blocks are fused.

Vertical fusion

%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 60, 'padding': 0}}}%%
flowchart TD
    C["concat"] --> N["next input"]
    V1["1x1 CBR"] --> C
    V2a["3x3 CBR"] --> C
    V2b["1x1 CBR"] --> V2a
    V3a["5x5 CBR"] --> C
    V3b["1x1 CBR"] --> V3a
    V4a["1x1 CBR"] --> C
    V4b["max pool"] --> V4a
    IC["concat<br/>input"] --> V1
    IC --> V2b
    IC --> V3b
    IC --> V4b

    classDef op fill:#78b900,stroke:#4a7a00,stroke-width:1px,color:#ffffff;
    class N,C,V1,V2a,V2b,V3a,V3b,V4a,V4b,IC op;

Horizontal fusion

%%{init: {'flowchart': {'nodeSpacing': 35, 'rankSpacing': 35, 'padding': 0}}}%%
flowchart TD
    C["concat"] --> N["next input"]
    F1["fused 1x1 CBR<br/>disaggregate"] --> C
    V2a["3x3 CBR"] --> C
    F1 --> V2a
    V3a["5x5 CBR"] --> C
    F1 --> V3a
    V4a["1x1 CBR"] --> C
    V4b["max pool"] --> V4a
    IC["concat<br/>input"] --> F1
    IC --> V4b

    classDef op fill:#78b900,stroke:#4a7a00,stroke-width:1px,color:#ffffff;
    class N,C,F1,V2a,V3a,V4a,V4b,IC op;

Figure: Graph with vertical fusion, then with addition of horizontal fusion.

Then, the optimizer can also look for opportunities to fuse parallel blocks, i.e. horizontal fusion. In this example, there are “sibling” blocks that perform the same 1x1 CBR on the same input (albeit with different filter weights), and it can be replaced with a fused block that does it more efficiently. After that fused computation, the output channels are split so each original branch still receives the portion it expects.

Although the math of the optimized graph is equivalent, the execution plan is different. In the unfused graph, we launch separate kernels for convolution, bias add, and ReLU, and each stage writes and rereads intermediate values from global memory. This is bad, because access to GPU global memory is slow.

GPU memory hierarchy levels. [Image source](https://cvw.cac.cornell.edu/gpu-architecture/gpu-memory/memory_levels)

Memory level Approximate access cost
Registers ~1 cycle
Shared memory ~10s of cycles
Global memory ~100s of cycles

Figure: Diagram of a GPU memory hierarchy, from registers to shared memory to global memory.

However, during kernel execution, we can use much faster on-chip memory: registers and shared memory. In the fused graph, where multiple operations are combined in a single fused kernel, those intermediate values are kept in registers or shared on-chip memory as much as possible, and we write out only the final fused result for that stage.

The table below summarizes some of the basic optimizations that can be applied to a graph of a neural network:

Optimization

Before

After

What changes and why it helps


Constant Folding

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  A["Const A"] --> B["Add"]
  C["Const B"] --> B
  B --> Y["Output"]

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  K["Const C (precomputed)"] --> Y["Output"]

If a subgraph depends only on constants, it is computed once during optimization. Inference skips those runtime ops.


Identity Elimination

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> I["Identity"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> Y

Removes pass-through nodes that do not change values, reducing operator count and runtime overhead.


Slice Elimination

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> S["Slice (full range)"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> Y

Deletes no-op slices that return the original tensor range.


Unsqueeze Elimination

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> U["Unsqueeze"] --> Q["Squeeze"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> Y

Removes temporary shape ops that cancel each other out.


Dropout Elimination

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> D["Dropout"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> Y

Drops training-only regularization ops in inference graphs.


Conv Add Fusion

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> C["Conv"] --> A["Add (bias)"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> F["Conv + bias"] --> Y

Fuses post-conv bias add into conv parameters, removing one op and one intermediate tensor write/read.


Conv Mul Fusion

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> C["Conv"] --> M["Mul (scale)"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> F["Scaled Conv"] --> Y

Absorbs post-conv scaling into conv weights.


Conv BatchNorm Fusion

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> C["Conv"] --> B["BatchNorm"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> F["Conv (fused W,b)"] --> Y

Folds BatchNorm parameters into conv weights and bias for inference-time execution.


Relu Clip Fusion

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> R["Relu"] --> C["Clip"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> F["Clipped Relu"] --> Y

Merges consecutive activation constraints into one stage.


Reshape Fusion

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> R1["Reshape"] --> O["Op"] --> R2["Reshape"] --> Y

%%{init: {'themeVariables': {'fontSize': '9px'}}}%%
flowchart TD
  X --> F["Fused Op"] --> Y

Folds reshape-only transitions into neighboring ops when semantics stay equivalent.

Some of these optimizations fuse operations, and save on round-trip memory access as discussed above. Other optimizations remove no-ops. These nodes exist in the first place because inference graphs are exported artifacts from training frameworks, and that export process keeps many useful pieces of structure: constants for things like normalization statistics, shape-manipulation steps such as Unsqueeze, Slice, and Reshape, and sometimes pass-through nodes like Identity that preserve names or graph wiring. There may also be operators like Dropout that don’t apply during inference. Graph optimization is removes this extra structure while keeping the same numerical result.

Although we have mostly discussed this in the context of neural networks, compilers also exist for other computation-intensive machine learning models, e.g. gradient-boosted trees.

7.3.3 Hardware-specific execution providers

Beyond generic graph optimizations, we can often get dramatic improvements by applying optimizations that are specific to the exact hardware that we will use for inference. An execution provider takes a model graph and runs it with its own hardware-specific kernels. The graph semantics stay the same, but the low-level implementation can change significantly across providers. An execution provider also builds an execution plan for memory placement, operation scheduling, and data movement, all of which strongly affect end-to-end latency and throughput.

To understand what an execution provider does, we should map the full AI compute stack. This is a set of layers that sit between a machine learning model and the physical hardware that runs it. At the top are frameworks like PyTorch, where developers write models using high-level operations such as matrix multiplication or convolution. Next, under the libraries sits the runtime layer, which handles launching work on the compute device and moving data around. The runtime talks to the device driver, which manages communication with the device and schedules work on it. Finally, at the bottom is the hardware device itself.

Layer NVIDIA GPU (CUDA) AMD GPU (ROCm) CPU
Layer 5: Frameworks PyTorch, TensorFlow, JAX PyTorch, TensorFlow, JAX PyTorch, TensorFlow, JAX
Layer 4: Math libraries e.g. cuBLAS e.g. rocBLAS e.g. oneDNN / MKL
Layer 3: Programming interface CUDA C++ HIP C++ / OpenMP
Layer 2: System runtime CUDA Runtime API ROCr OS runtime + thread scheduler
Layer 1: Kernel driver NVIDIA Driver AMDGPU Driver OS kernel + CPU driver
Layer 0: Hardware e.g. H100 e.g. Instinct MI300X e.g. Xeon / EPYC

When people talk about “lowering” an operation, they mean translating a high-level framework operation down through the layers of the compute stack until it becomes concrete instructions that hardware can execute. Frameworks define operations in a hardware-agnostic way - for example, a matrix multiplication in PyTorch. When a hardware-specific execution provider is selected, those generic operations are mapped to implementations optimized for that backend. For example, with a CUDA execution provider, a matrix multiplication might be lowered into a call to cuBLAS through CUDA, whereas a CPU execution provider might lower the same operation into a vectorized CPU kernel.

Matrix multiplication is a good example of how the same high-level operation can be implemented differently depending on the hardware. Even though the math is identical, the way that data is accessed in memory can significantly impact performance. Consider a naive matrix multiplication:

for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        C[i][j] = 0;
        for (int k = 0; k < K; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

where i iterates over rows of A, j iterates over columns of B, and k performs the dot product between a row of A and a column of B. Each output element is

C[i][j] = Σ_k A[i][k] * B[k][j]

Notice that the naive implementation reads the entry \(A[i,k]\) many times while computing different elements of the output matrix \(C\). Each output \(C[i,j]\) is the dot product of row \(i\) of \(A\) and column \(j\) of \(B\), so the inner loop repeatedly accesses \(A[i,k]\) and \(B[k,j]\).

To see this visually, consider the situation shown below. Suppose a thread is assigned to compute a single output element \(C[i,j]\) marked in purple. To do this, the thread runs the inner loop over \(k\), loading \(A[i,k]\) and \(B[k,j]\), multiplying them, and accumulating the result in the purple square.

Naive matrix multiplication access pattern where one output element uses one row from A and one column from B.

The issue is that nearby threads are often performing almost the same work. For example, threads computing \(C[i,0]\), \(C[i,1]\), \(C[i,2]\), and so on all need the same sequence of values \(A[i,k]\) as they iterate through \(k\). In the naive implementation, however, each thread independently loads those values from global memory. As a result, the same element \(A[i,k]\) may be fetched many times by different threads, leading to unnecessary memory traffic and reduced performance.

If each matrix is \(16 \times 16\), then the 16 threads computing the row \(C[i,0], C[i,1], \dots, C[i,15]\) all need the same row \(A[i,*]\). In the naive implementation, each thread loads the entries \(A[i,k]\) independently from global memory, so each value \(A[i,k]\) is fetched 16 separate times while forming that row of \(C\).

If we are careful about how we organize the work to take advantage of the GPU architecture, we can do much better. To understand how, we should understand how a GPU executes code. First, some definitions:

  • thread: one logical worker with its own thread index (threadIdx), registers, program counter state, and predicate mask. In a matrix multiply kernel, each thread usually owns one output element (or a small tile of elements) and updates a private accumulator in registers before writing its final value to global memory.
  • warp/wavefront: the hardware scheduling unit inside a compute core. Threads in one warp/wavefront execute the same instruction stream in lockstep lanes, so control-flow differences are handled by masking lanes on each branch path. Typical sizes are 32 lanes on NVIDIA (warp) and 64 lanes on AMD (wavefront).
  • thread block: a programmer-defined group of threads launched together on one multiprocessor/compute unit, with shared memory and barrier synchronization scope limited to that block (__syncthreads/equivalent). Block shape (blockDim.x/y/z) determines occupancy, shared-memory pressure, and how many warps/wavefronts are created from that block.
  • grid: the complete 1D/2D/3D collection of thread blocks for one kernel launch (gridDim.x/y/z). Grid dimensions determine total parallel work, while each block computes its global coordinates from (blockIdx, threadIdx, blockDim) to map onto tensor indices and avoid overlap with other blocks.

Next, note that we can decompose the matrices into smaller tiles, then e.g. \(C_{00} = A_{00} B_{00} + A_{01} B_{10}\).

Matrix multiplication decomposition into tiles and block products.

Now suppose that a group of \(8 \times 8\) threads in a thread block—with access to a common shared memory—collaborates as follows. First, the threads cooperate to load the tiles \(A_{00}\) and \(B_{00}\) from global memory into shared memory, with each thread loading one element of each tile. They then synchronize to ensure both tiles are fully loaded. Once this is done, each thread can compute one entry of the block \(C_{00}\) using only values stored in shared memory.

Thread block tile operation: cooperative load to shared memory, synchronize, then compute partial output tile.

This produces a partial result for \(C_{00}\), which it will keep in its register. The block then loads the next tiles \(A_{01}\) and \(B_{10}\) and repeats the process, accumulating into the same output entries so that eventually \(C_{00} = A_{00}B_{00} + A_{01}B_{10}\). Finally, it will write this result back to global memory.

Here is the flow again, with the memory access shown explicitly:

First step:

First tiled iteration showing global-to-shared loads and partial accumulation.

Second step:

Second tiled iteration showing next tile loads and accumulated output update.

Note that tile \(A_{00}\), for example, contributes to two output tiles: \(C_{00}\) and \(C_{01}\). A different thread block may therefore load \(A_{00}\) from global memory again when computing \(C_{01}\), so the tile may indeed be fetched more than once overall. However, this is still far better than in the naive implementation. There, each thread computing an element of \(C\) independently loads the entries \(A[i,k]\) from global memory, so the same value may be fetched many times by different threads. With tiling, each entry of \(A_{00}\) is loaded once per block and then reused by many threads from shared memory.

Here is what this implementation might look like:

#define T 8  // tile size (blockDim.x = blockDim.y = T)

__global__ void matmul_tiled(const float* A, const float* B, float* C,
                             int M, int N, int K) {
    __shared__ float As[T][T];
    __shared__ float Bs[T][T];

    int row = blockIdx.y * T + threadIdx.y;
    int col = blockIdx.x * T + threadIdx.x;
    float sum = 0.0f;

    for (int kk = 0; kk < K; kk += T) {
        // Cooperative load: one A element and one B element per thread.
        if (row < M && kk + threadIdx.x < K)
            As[threadIdx.y][threadIdx.x] = A[row * K + (kk + threadIdx.x)];
        else
            As[threadIdx.y][threadIdx.x] = 0.0f;

        if (kk + threadIdx.y < K && col < N)
            Bs[threadIdx.y][threadIdx.x] = B[(kk + threadIdx.y) * N + col];
        else
            Bs[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        for (int k = 0; k < T; ++k) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

This kernel matches the tiled pattern we described. Each block owns one T x T output tile, and each thread computes one output entry (row, col) with a running sum in a register. The kk loop steps through the shared \(k\) axis in tiles. On each step, threads cooperatively load one tile of A and one tile of B into shared memory (As, Bs), synchronize, compute T multiply-accumulate operations from shared memory, then synchronize again before reusing those buffers. The bounds checks handle edge tiles when dimensions are not multiples of T. After all kk tiles are processed, each thread writes one final value to C.

The point is reuse: values loaded from global memory are reused many times from shared memory before we move to the next tile. This gives us a better ratio of computation to memory access.

The loop above is designed for memory access on a GPU, and it might be further optimized for a particular GPU family. For example, the tile size may be chosen to match the number of threads in a warp or the amount of available shared memory. If we were running a model on CPU, we would use tiling for a similar purpose - reducing memory traffic. But on a CPU, the hardware structure is different, so the implementation ends up looking different too.

CPU and GPU core organization comparison (Cornell Virtual Workshop).

On CPUs, the goal is mostly to take advantage of the cache hierarchy (from fastest to slowest: L1, L2, L3). Unlike the GPU version, the programmer or compiler doesn’t explicitly move data into cache for CPU execution; the hardware does that automatically. Instead, the code is written to access memory in a cache-friendly order.

In the naive implementation,

for (i)
  for (j)
    for (k)
      C[i,j] += A[i,k] * B[k,j]

each output element \(C[i,j]\) is computed by walking across row \(i\) of \(A\) and down column \(j\) of \(B\).

But when a CPU loads an entry from slow memory into its fast cache, it actually loads an entire cache line containing several nearby elements. This means:

  • When iterating over the row of \(A\), we incur a cache miss on the first entry, \(A[i,k]\). The cache-line fetch by the processor will hold within it the next 15 floats \(A[i,k+1], A[i,k+2], \dots\) as well, which is a good use of cache because the next iterations use \(A[i,k+1], A[i,k+2], \dots\)!
  • However, the accesses to \(B[k,j]\) move down a column of \(B\). Since matrices are usually stored in row-major order, successive elements \(B[k,j], B[k+1,j], B[k+2,j], \dots\) are far apart in memory. As a result, each access (each iteration of the inner loop) incurs a cache miss and has to load a new cache line from \(B\), even though only a single element from it is used. Furthermore, by the time we actually finish iterating down the column and move over to need \(B[k+1,j]\), it has probably been evicted from the cache already to make room for the other data we’ve used in the interim.

This leads to poor cache utilization and many unnecessary memory accesses.

Naive CPU matrix multiplication memory access pattern across row-major matrices.

With this in mind, even just reordering the loops so that the innermost loop moves over memory sequentially will help:

for (int i = 0; i < M; i++) {
    for (int k = 0; k < K; k++) {
        float a = A[i][k];
        for (int j = 0; j < N; j++) {
            C[i][j] += a * B[k][j];
        }
    }
}

Visually, it looks like this:

CPU matrix multiplication with loop reordering for cache-friendly access.

Now, for a fixed (i,k):

  • A[i][k] is loaded once (and will be reused across many multiplications)
  • B[k][j] walks across a row of B (contiguous memory access, cache friendly)
  • C[i][j] walks across a row of C (each iteration accumulates partial results)

Note that the inner loop accesses contiguous memory:

B[k][0], B[k][1], B[k][2], ...
C[i][0], C[i][1], C[i][2], ...

as it computes

C[i][0] += A[i][k] * B[k][0]
C[i][1] += A[i][k] * B[k][1]
C[i][2] += A[i][k] * B[k][2]
...

In the reordering example above, we reused one value, \(A[i,k]\), for 16 multiplications (16 iterations of the inner loop). But, we can do even better! The solution, again, is tiling. Suppose that instead of moving across a row of \(C\) at a time, we move across small blocks of \(C\):

First step:

CPU tiling step 1.

Second step:

CPU tiling step 2.

Third step:

CPU tiling step 3.

Fourth step:

CPU tiling step 4.

Now, the CPU repeatedly uses a few rows of \(A\) and \(B\) for many multiplications in a small block of \(C\). With loop reordering, a value such as \(A[i,k]\) is reused across the inner loop and participates in 16 multiplications (for a \(16 \times 16\) matrix). With tiling, an entire \(4 \times 4\) block of \(A\) and \(B\) remains in cache while 64 multiplications are performed, allowing many elements to be reused before they are evicted.

Compare this to the naive implementation: there, the algorithm walks down a column of \(B\). Since the matrix is stored in row-major order, each access to \(B[k,j]\) loads a different row of \(B\) into the cache. By the time the algorithm reaches the bottom rows of \(B\), the earlier rows have already been evicted from the cache, even though they will be needed again when computing the next column of \(C\). With tiling, the algorithm restricts the computation to a small block of \(B\) at a time, and because that block is small enough to fit in cache, its rows remain resident while many multiplications are performed, allowing the data to be reused before it is evicted.

Here is what the implementation might look like, combining the reordering trick and the tiling:

#define T 4

for (int ii = 0; ii < M; ii += T) {
  for (int jj = 0; jj < N; jj += T) {
    for (int kk = 0; kk < K; kk += T) {

      for (int i = ii; i < ii + T && i < M; i++) {
        for (int k = kk; k < kk + T && k < K; k++) {

          float a = A[i][k];

          for (int j = jj; j < jj + T && j < N; j++) {
            C[i][j] += a * B[k][j];
          }

        }
      }

    }
  }
}

Besides for optimizing memory access, hardware-specific kernels also take advantage of specific compute patterns. For example, we mentioned earlier that within each thread block, the hardware organizes threads into warps for lockstep execution.

SIMT execution model (AMD HIP documentation).

The idea of “lockstep execution” relates to how a GPU executes code. The execution paradigm is known as SIMT. In this model, threads are scheduled in small groups called warp/wavefront. During a step, every thread in the warp can simultaneously execute an operation on a different piece of data:

SIMT Execution Model op a[i] = b[i] + c[i] t0 Thread 0 ---------------- b[0] = 5 c[0] = 3 a[0] = 8 op->t0 t1 Thread 1 ---------------- b[1] = 2 c[1] = 4 a[1] = 6 op->t1 t2 Thread 2 ---------------- b[2] = 7 c[2] = 1 a[2] = 8 op->t2 t3 Thread 3 ---------------- b[3] = 3 c[3] = 5 a[3] = 8 op->t3

This means that if half the threads in one group take an if branch and the other half do not, the hardware executes one path while masking the other, then executes the other path sequentially. Some lanes are idle during each path, so throughput drops. This is called divergence. For that reason, GPU kernels are usually written so that threads in the same warp follow the same execution path whenever possible.

Similarly, GPUs are most efficient when threads in a warp read contiguous memory locations, a pattern known as memory coalescing. For example, if thread 0 reads A[0], thread 1 reads A[1], thread 2 reads A[2], and so on, the hardware can combine those requests into a small number of memory transactions (memory coalescing). But if each thread reads from widely separated addresses, the GPU may need many separate transactions, which increases latency and reduces bandwidth efficiency. This is another reason tiled kernels are structured carefully: they often arrange threads so that neighboring threads load neighboring elements from global memory before placing them into shared memory.

Perfectly coalesced global memory access in a warp (NVIDIA CUDA C Programming Guide). Figure: Perfectly coalesced warp memory access.

Uncoalesced global memory access in a warp (NVIDIA CUDA C Programming Guide). Figure: Uncoalesced warp memory access.

As a concrete byte-level example, assume one warp has 32 threads and each thread loads one 4-byte word. The warp needs 32 x 4 = 128 bytes of useful data. In the coalesced case, consecutive threads request consecutive words, so hardware can fetch the required 128 bytes in four 32-byte transactions. That means 128 bytes transferred and 128 bytes used, so effective memory utilization is 100 percent. In the worst uncoalesced case, neighboring threads access addresses at least 32 bytes apart. Then each thread can force a separate 32-byte transaction, for 32 x 32 = 1024 bytes transferred total. But the warp still only uses 128 bytes of payload (4 bytes per thread), so effective utilization drops to 128/1024 = 12.5 percent. This is an 8x increase in traffic for the same arithmetic work.

These hardware considerations show why kernels differ between platforms. Even when the underlying algorithm is the same, the details of how threads are organized, how memory is accessed, and which low-level instructions are available can vary depending on the GPU architecture. For example, a CUDA kernel written for GPUs from NVIDIA might use warp-level primitives such as warp shuffle instructions or assume a warp size of 32 threads. A kernel targeting GPUs from AMD through ROCm might instead rely on wavefront operations and different scheduling assumptions, since AMD GPUs use wavefronts (often 64 threads) as their basic execution group. Different CPU and GPU architectures may also expose different instruction sets or specialized hardware instructions - for example tensor or matrix-multiply instructions - which kernels can use to accelerate certain operations. As a result, even when the high-level algorithm is the same, the compiled kernel may differ depending on which hardware architecture it targets.

7.3.4 Reduced precision and quantization

After graph and kernel optimization, the next big lever is numeric precision. We can often reduce serving latency and cost by storing tensors with fewer bits and by using lower-precision math. The tradeoff is that we introduce approximation error, so we have to check task quality carefully.

Note that this is distinct from reduced-precision training, which we discussed in a previous unit. Here, we are focused on post-training inference optimization: we take a trained model and transform how values are represented and executed at inference time. However, the key idea is the same as before: lower-precision arithmetic moves fewer bits per value, so memory bandwidth pressure drops, cache utilization improves, and the hardware can often execute more operations per cycle on specialized low-precision units. On modern GPUs, tensor cores are designed for high-throughput matrix math at low precision (for example FP16/BF16 and often INT8, sometimes INT4 paths depending on hardware/runtime), which is a major reason quantized and reduced-precision inference can be much faster.

This section discusses two related ideas: reduced precision and quantization.

In the context of training, we had said that reduced precision (especially with specialized representations like bf16) is a good way to reduce memory pressure. But in training, we are much more concerned with keeping floating-point dynamic range in order to capture stable gradient updates over many steps, so full integer quantization tends to be very fragile.

In the context of inference, reduced precision is still an easy first step. Many models keep almost the same quality, while compute throughput and memory footprint improve. However, quantization can be much faster and we are less concerned about numerical stability, so it becomes much more appealing in this context.

At a high level, quantization maps floating-point values into low-bit integer space:

\[\begin{equation*} \texttt{val}\_\texttt{quant} = \texttt{round}\left(\frac{\texttt{val}\_\texttt{fp32}}{\texttt{scale}}\right) + \texttt{zero}\_\texttt{point} \end{equation*}\]

And dequantization maps them back to approximate floating-point values:

\[\begin{equation*} \texttt{val}\_\texttt{fp32} \approx \texttt{scale} \times (\texttt{val}\_\texttt{quant} - \texttt{zero}\_\texttt{point}) \end{equation*}\]

where scale is a positive real number used to map floating-point numbers into quantization space, and zero_point aligns integer zero with the floating-point range. Most implementations use round-to-nearest (often ties-to-even), but exact rounding behavior can vary by framework and hardware, e.g. floor/ceil are possible.

As a concrete example, if one weight value is 0.684, with scale = 0.00296 and zero_point = -59, then:

\[ \texttt{val}\_\texttt{quant} \approx \texttt{round}(0.684 / 0.00296) - 59 = 172 \]

The stored value is now an integer instead of FP32. That reduces model size and memory traffic, which is often a major serving bottleneck.

At this point, the important practical question is: how do we choose scale and zero_point for each tensor? Most “types” of quantization are different answers to that question. Some methods choose one parameter set for an entire tensor, others choose one per channel. Some center the integer range around zero, others shift it with a nonzero offset. These design choices determine quantization error, runtime efficiency, and implementation complexity.

Furthermore, we may choose different strategies for weight quantization and activation quantization, which have separate quantization paths:

  • When we say we quantize weights, we mean we convert and store them in low-bit form ahead of inference. Since we have the weights in advance, we can quantize them in advance.
  • When we say we quantize activations, we mean activation values are quantized as they are produced during inference. We cannot quantize activations in advance because to compute them, we need the inputs.

Weights are fixed after training, while activations change for every request, so these two quantization paths are handled differently.

The first choice is mapping style. In symmetric quantization, we usually set zero_point to 0 (for signed int), and choose scale from the maximum absolute value. This keeps math simple and is often efficient for quantizing weights. The tradeoff is that if values are not centered around zero, part of the integer range is wasted.

In asymmetric quantization, we allow a nonzero zero_point, so the mapped integer range can shift to match skewed data. If the float range is not centered around zero, zero_point will usually not be equidistant from quantized min and quantized max. This is often useful for activations, which may be mostly positive or otherwise offset. The tradeoff is slightly more bookkeeping and sometimes less convenient kernel math.

Figure What it shows
Symmetric data + symmetric quantization: this is a clean match. Because values are distributed around zero, a zero-centered integer range uses levels efficiently with simple math.
Symmetric data + asymmetric quantization: still works, but the nonzero offset does not buy much when the distribution is already centered. We add parameter complexity without clear range-fit benefit.
Asymmetric data + asymmetric quantization: typically a better fit for shifted distributions (for example mostly positive activations), because zero_point shifts the integer range toward where values actually lie.

The second choice is granularity: how many zero point and scale values should we maintain?

A tensor is just a structured numeric array (for example, a matrix for a linear layer, or a 4D array for convolution features). A channel is one full feature map inside that tensor. For example, if an output has shape [1, 64, 56, 56], then it has 64 channels, each channel being a 56 x 56 map.

Tensor view of a model output.

Tensor view: one tensor contains all channels for a layer output (for example shape [batch, channels, height, width]). This illustrates per-tensor quantization (one shared scale/zero_point for the full tensor).

Channel view extracted from a tensor.

Channel view: one channel is one feature-map slice from that tensor. Per-channel quantization gives each such slice its own scale (and sometimes zero_point).

In per-tensor quantization, one scale/zero_point pair is shared by all values in the tensor. This is simpler and fast, but one outlier can force a very large scale and reduce effective precision for typical values. In per-channel quantization, each channel gets its own scale (and sometimes zero point). This usually reduces error because each channel can use a tighter range. The common downside is extra metadata and slightly more complex execution planning.

A concrete intuition is weight matrices in linear layers. Different output channels often have different value ranges. This happens because each channel corresponds to a different learned detector (different weights, different feature patterns, different activation frequency). Some channels learn strong, high-magnitude responses, while others carry weaker or sparser signals. When one global scale must cover all of them, channels with smaller ranges get too few useful quantization levels, which increases relative error. With per-channel quantization, each channel gets a more appropriate scale, and reconstructed values are usually closer to FP32.

For many serving systems, a strong default is: per-channel quantization for weights, per-tensor quantization for activations, and then adjust only if quality or performance targets are missed.

Next, we need to consider when activation scale and zero_point are chosen: offline during calibration in static quantization, or online at runtime in dynamic quantization.

In static quantization, we prepare both weight and activation quantization parameters before deployment. To compute the scale and zero points for activations, we run a calibration pass on representative data, record activation distributions per layer, and then bake resulting scales/zero points into the quantized graph. At serving time, the runtime no longer needs to estimate those ranges on each request, so static quantization often has lower latency and higher throughput than dynamic quantization. But there is a risk of calibration mismatch: if calibration data does not reflect production data, quality can degrade after launch.

In dynamic quantization, we quantize weights ahead of serving, but compute activation quantization parameters while requests are being processed. In other words, each runtime batch (or token step) observes activation ranges, computes scales/zero points, and then runs integer kernels. This makes dynamic quantization easy to apply because we do not need a calibration dataset before deployment. The tradeoff is extra runtime overhead from range collection and parameter computation, and usually less aggressive optimization than fully static pipelines.

First convolution in a MobileNetV2 without quantization.

First convolution in a MobileNetV2 without quantization.

First convolution in a MobileNetV2 with static quantization. QLinearConv does integer convolution with 8-bit output.

First convolution in a MobileNetV2 with static quantization. QLinearConv does integer convolution with 8-bit output. Note that x, w, and y each have separate scale/zero_point values, because input activations, weights, and output activations usually have different numeric ranges.



First convolution in a MobileNetV2 with dynamic quantization. ConvInteger does integer convolution with 32-bit output.

First convolution in a MobileNetV2 with dynamic quantization. ConvInteger does integer convolution with 32-bit output.

Figure: Illustration of the effect of post-training quantization on the graph of a neural network.

These three graphs show the same early MobileNetV2 computation expressed in different execution forms. In the non-quantized graph, weights and activations stay in floating point, and the convolution path uses floating-point operators. In the static-quantized graph, weights are already converted to low-bit form and activation quantization parameters are fixed from offline calibration, so the graph contains explicit quantized operators such as QLinearConv. In the dynamic-quantized graph, weights are also pre-quantized, but activation scales are computed from the current input at runtime. That is why the graph includes extra quantize/dequantize steps around ConvInteger: activations are quantized before convolution, integer convolution runs and accumulates in a wider integer type (often int32), then outputs are converted to the required output format. The figures do not show scale and zero-point values because in dynamic quantization those are runtime values, computed at inference time from the current activation tensor.

For some models, especially LLMs, weight-only quantization is most useful. During generation, serving is frequently memory-bandwidth bound, so shrinking weights can improve token throughput while preserving more quality than aggressive activation quantization.

To manage the tradeoff between speed/cost and quality, we can use accuracy-aware tuning for quantization. The idea is analogous to hyperparameter tuning, but for post-training quantization: instead of picking one quantization config once, we define a tuning space (precision choices, per-tensor vs per-channel choices, operator exclusions, calibration options), generate candidate quantized models, evaluate each one, and keep iterating until we meet an accuracy goal.

Intel Neural Compressor accuracy-aware tuning workflow.

Figure: Accuracy-aware quantization tuning feature in Intel Neural Compressor.

The process behaves like a closed loop around the quantization pipeline. The left side is model transformation (float model + algorithm config -> calibrate -> quantized model). The right side is search control (tuning config -> tuning space -> evaluate -> decision). If accuracy is below target, the tuner proposes a new config and tries again. If accuracy is acceptable, we stop and use the selected quantized model.

When post-training quantization is not accurate enough, we can use quantization-aware training (QAT). The core idea is, we inject quantization behavior during training, so the model learns weights that are already robust to quantization.

In QAT, frameworks insert fake-quant behavior around quantizable operators (for example linear and convolution layers). During forward pass, values are quantized and dequantized in simulation:

\[ \texttt{x}\_\texttt{fake} = \texttt{dequant}(\texttt{quant}(\texttt{x})) \]

So training still runs with floating-point tensors, but those tensors are forced to behave like quantized tensors in forward computation. This exposes the model to clipping and rounding effects during optimization, which is exactly what we need for reliable deployment quality.

The backward pass is a little more complicated. Rounding is not differentiable, so QAT usually uses a straight-through estimator (STE) style gradient rule. A common form is:

\[ \frac{\partial \mathcal{L}}{\partial x} \approx \frac{\partial \mathcal{L}}{\partial \texttt{x}\_\texttt{fake}} \cdot \begin{cases} 1, & x_{\min} \le x \le x_{\max} \\ 0, & \text{otherwise} \end{cases} \]

This means we treat fake-quant as identity for values that are inside the representable quantization range, and we block gradients for values that are outside that range (which are clipped in forward pass anyway).

The STE gives us a useful training signal that is aligned with deployment behavior: forward pass includes real quantization effects (rounding and clipping), so the network learns to compensate for them, while backward pass remains trainable enough for SGD/Adam to update weights.

Compared with post-training quantization, QAT is more expensive because it requires additional training cycles and validation. But when we need aggressive low-bit deployment or strict quality targets, QAT is often one of the most reliable ways to recover accuracy.

7.3.5 Evaluating a model for inference

At this stage, we want to answer one narrow question: is this model artifact itself a good serving candidate, before any system level effects are mixed in. So we evaluate model variants in a controlled harness on fixed hardware, with fixed runtime settings, and with identical inputs.

For variant that are not necessarily equivalent numerically (e.g .quantized models), we should record both quality and serving behavior. Quality means the task metric we care about (for example F1 for fraud detection, NDCG for ranking, top-1 accuracy for image classification), plus a few important slices. For example, if we test three ranking variants (fp16, int8 static, int4 weight-only), we should check not only overall NDCG but also NDCG for sparse-query traffic and new-user traffic, because quantization error often shows up first on tail slices.

On the serving side, model-only metrics should include p50 latency, throughput at fixed batch sizes, and peak memory use. A good pattern is to run each variant at batch sizes 1, 8, and 32 on the same GPU and report requests/sec and memory headroom for each.

We should always compare precision variants directly against a floating-point baseline, and report deltas in one place. For example: INT8 static: -0.4 F1, +1.9x throughput, -42% memory; INT4 weight-only: -1.7 F1, +2.6x throughput, -61% memory. This makes tradeoffs explicit.

It is important to make fair comparisons, which means we chnge one modeling variable at a time (model size, precision mode, or quantization method).

The output of this evaluation is not a final deployment decision. It is a shortlist: model variants that meet a quality floor and have acceptable model-only speed and memory. We then take those shortlisted variants into the next two sections, where we evaluate full inference-system behavior and infrastructure cost/risk.

7.4 Inference system

In the previous section, we treated the model as an artifact and evaluated it in isolation. But in production, that artifact lives inside an inference system with request handling, feature retrieval, batching, retries, timeouts, and deployment constraints. So the next step is to evaluate not only “is this model good enough,” but also “does the surrounding system deliver this model reliably, quickly, and cost-effectively under real traffic.”

7.4.1 Designing the prediction service

A good serving design starts with a clear boundary between application logic and inference logic. We expose a prediction endpoint with a stable request/response contract, then keep model execution concerns behind that boundary. This gives us a clean interface for teams and lets us evolve model internals without rewriting app routes.

First, define the API contract explicitly. That includes request schema, response schema, versioning strategy, and error behavior. For example, for an image classification endpoint we might accept one base64 image payload and return prediction, probability, model_version, and request_id. If input validation fails, we should return a clear 4xx response with a structured error body.

Second, separate responsibilities. The app-facing service should handle user/session context and application logic. The inference service should handle model loading, model-specific preprocessing, prediction execution, and model-specific postprocessing. This separation lets us scale app and inference tiers independently (for example CPU-heavy app workers and GPU-backed inference workers), and makes it easier to update components independently.

Third, design the boundary with operational details in mind: retry, idempotency, traceability. A practical pattern is to attach a request ID at the app layer, pass it through to inference, and include it in logs and responses. This way, we can trace a single request across app, feature retrieval, and model execution.

For example, here is a hypothetical inference endpoint implemented in FastAPI. Suppose we have a web app that accepts uploads and calls a separate inference endpoint at /predict. The web app does not load the model; it forwards payloads and renders responses. The inference service owns model loading and prediction execution.

A serving process usually has an initialization phase that runs once at startup (load model files, initialize runtime state, warm caches), and one or more prediction routes that handle HTTP or RPC requests.

First, let’s look at the startup behavior:

app = FastAPI()
model = None
model_version = "mobilenetv2-fp16-v5"

class PredictRequest(BaseModel):
    image: str  # base64-encoded image bytes

class PredictResponse(BaseModel):
    prediction: str
    probability: float
    model_version: str

@app.on_event("startup")
def load_model():
    global model
    model_path = os.getenv("MODEL_PATH", "/models/mobilenetv2-fp16-v5.pt")
    model = torch.jit.load(model_path, map_location="cpu")
    model.eval()

Here,

  • app = FastAPI() creates the serving application.
  • PredictRequest and PredictResponse define the API contract as Pydantic models: FastAPI uses them to parse and validate incoming JSON, validate and serialize response bodies, and generate OpenAPI schema/documentation automatically. This keeps request/response structure explicit.
  • @app.on_event("startup") is a Python decorator: it attaches the function below it to a framework lifecycle event. startup runs once when the service process starts and is appropriate for one-time initialization like loading model artifacts or opening database connections. The corresponding shutdown event is used for cleanup (closing files, flushing logs, releasing connections).
  • The startup hook load_model() runs once when the process starts, reads the model path from configuration, loads it into memory, and switches it to evaluation mode with model.eval(). (That eval() call is important because it disables training-time behaviors like dropout and uses inference-time behavior for modules like batch norm.)

Next, let’s look at the prediction route (per-request path):

preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="model is not loaded")
    if not req.image:
        raise HTTPException(status_code=400, detail="image payload is required")

    image_bytes = base64.b64decode(req.image)
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    x = preprocess(image).unsqueeze(0)  # add batch dimension

    with torch.inference_mode():
        logits = model(x)[0]
        probs = torch.softmax(logits, dim=0)
        class_idx = int(torch.argmax(probs).item())
        probability = float(probs[class_idx].item())

    return PredictResponse(
        prediction=str(class_idx),
        probability=probability,
        model_version=model_version,
    )

Routes are the request-to-function mapping layer of the service. @app.post("/predict", response_model=PredictResponse) says: when an HTTP POST request hits /predict, call predict(...), parse the JSON body into PredictRequest, and validate/serialize the returned object as PredictResponse.

Because req is typed as PredictRequest, request validation happens before the function body runs. If a required field is missing (for example image), FastAPI returns 422 Unprocessable Entity automatically with a structured validation error.

Inside the route, we follow a predictable serving pipeline. First, fast guard checks fail early with clear status codes (503 when model is not ready, 400 for invalid payload). Next, we decode payload bytes and preprocess them into the input format expected by the model. Then inference runs inside torch.inference_mode() to disable autograd tracking and reduce overhead. Finally, we postprocess logits into user-facing fields and return a typed response that includes model_version, which is important for debugging.

This boundary design gives us flexibility for both online and batch modes. For online inference, we call /predict synchronously with strict timeouts. For batch inference, we can expose a separate batch endpoint or job interface.

7.4.2 Feature retrieval

In many systems, model compute is not the slowest part of the request path. The slower part is often fetching or computing the inputs the model needs.

We can think of the online path as:

  • request arrives
  • retrieve features
  • compute missing real-time features
  • run model
  • return response

If feature computation takes 40 ms and model inference takes 12 ms, improving the model alone will not fix end-to-end latency.

When we say feature retrieval, we mean runtime lookups such as “get user embedding”, “get recent click count”, or “get item popularity in last hour”. Each lookup adds network round-trip time, serialization/deserialization overhead, and possible contention in backing stores. A request that needs 8 features from 3 different systems usually has higher and more variable latency than a request that reads the same features from one colocated store.

We also may need online feature computation, such as “minutes since last session”, “distance from user to store”, or “rolling 10-minute click-through rate”. These features are computed after retrieval and before inference in the hot path. Their cost depends on how much state we need to fetch and how much arithmetic or aggregation we do per request.

For latency budgeting, it helps to separate the feature stage into three parts: lookup latency, compute latency, and wait time from queueing. For example, if p95 endpoint latency target is 120 ms, we might allocate 50 ms to retrieval and online feature compute, 40 ms to model inference, and 30 ms to network plus service overhead. This framing makes bottlenecks visible and tells us what to optimize first.

7.4.3 Batching and dynamic batching

Batching is efficient because one model execution can process multiple inputs together. The intuition is like an elevator: carrying 8 people in one trip is more efficient than 8 separate trips. GPUs benefit from the same idea.

The challenge is that we do not control exactly when requests arrive. In most online systems, requests arrive one at a time from users, and they arrive irregularly. So, client-side pre-batching is often impractical for user-facing APIs.

We also usually do not want to enforce batching all the time by making every request wait in a queue until enough other requests arrive. During low-traffic periods when new requests arrive slowly, that would hurt latency. This is where dynamic batching helps. When the server is ready to grab the next request from the queue, it can grab one request or more, depending on how many are waiting and what its maximum batch size is.

The benefits of dynamic batching are most evident when arrivals are bursty. With constant interarrival interval, requests will not have to wait in a queue as long as the service rate is faster than the average request rate. Consider the 16 sequential requests below, with the blue interval illustrating the request transfer time and the purple interval, the compute time for inference:

Queue behavior with constant interarrival traffic.

Figure: Constant interarrival traffic (steady arrivals) and resulting queue behavior.

If arrivals are bursty, though, with some small and some long intervals between requests, then the exact same average request rate can lead to a lot of queuing. The time spent in a queue is illustrated with a red interval:

Queue behavior with bursty arrivals.

Figure: Bursty arrivals create queuing and queueing delay.

With dynamic batching, if the server can batch multiple requests together, there is much less queuing delay:

Bursty arrivals with dynamic batching enabled.

Figure: Dynamic batching under bursty traffic can improve throughput by combining queued requests.

We may also opt to add a slight delay between the end of a batch and the start of a new one, in case more requests arrive and can be added to the batch. This can add a small amount of delay to all requests, but potentially save large queuing delays for some requests.

Bursty dynamic batching showing delay tradeoff.

Figure: Dynamic batching delay tradeoff: larger batches can improve efficiency but increase per-request waiting time.

7.4.4 Scale with replicas and GPUs

When batching is no longer enough, we increase service rate by adding replicas and GPUs. There are two common patterns: more model instances on the same GPU, or model instances spread across multiple GPUs.

More instances usually reduce queue delay because more requests can be processed in parallel. But there is a limit: instances on the same GPU contend for compute and memory bandwidth, so per-instance inference time can increase. The goal is not “max instances”, but the best point where queue delay drops faster than compute time rises.

As instances per GPU increase, queue delay decreases while compute inference time increases. Their intersection marks a candidate operating point.

Figure: As we add model instances per GPU, queue delay usually drops because more requests can run concurrently, but compute infer time often rises due to contention on the same device. We look for the intersection region as a candidate operating point, then confirm with p95 latency, throughput, and cost.

7.4.5 Evaluating an inference system

At the inference-system level, we are no longer evaluating a model artifact in isolation. Instead, we want to evaluate the full request path under realistic load, and inspect latency breakdown to find bottlenecks. For online traffic, we might measure p50/p95/p99 latency, throughput, timeout/error rate, and queue delay. For batch traffic, we might track total completion time and throughput.

The most useful breakdown separates request latency into queue, compute input (parse/decode/preprocess), compute infer (model execution), compute output (postprocess/serialize), and framework overhead. This helps us answer a concrete question: are we waiting in line, or are we actually computing? We want to know the answer, because there are different solutions in each case (e.g. add more replicas vs. optimize the model).

Evaluation should include different traffic patterns: fixed low concurrency (baseline service time), higher concurrency (contention behavior), and constant versus Poisson arrivals (burst sensitivity).

7.5 Infrastructure

In the previous section, we focused on inference-system behavior. Now we look at the final big knob: infrastructure. This is where we decide where the system runs, what hardware it runs on, and how much capacity we keep online. Those choices largely determine where we land on the fast-cheap-good tradeoff.

7.5.1 Placement

Placement is usually the first infrastructure decision. A practical way to frame it is: where do we want to pay latency, and where do we want to pay cost and operations overhead?

One option is on-device inference. For example, keyboard autocomplete, camera enhancement, or wake-word detection often run on a phone. This eliminates network latency because the model runs where the input is produced, and also makes the feature available offline. It often improves responsiveness and can simplify privacy requirements, but we give up hardware flexibility and usually need smaller models that fit edge memory, storage, and power limits.

Another option is one centralized cloud deployment. For example, one GPU-backed inference service in one region for a document extraction API. This is simpler to operate and can use stronger hardware efficiently, but every request pays round-trip network latency to that region.

A middle ground is regional deployments. For example, run the same inference stack in us-east, eu-west, and ap-southeast, then route each user to the nearest region. This improves network latency versus one central region, but increases cost and operations complexity because we run more replicas, replicate model artifacts and supporting data, and manage multi-region rollouts.

Data residency requirements can force this choice. If we cannot move raw user data across regions, we need region-local serving stacks even when a central deployment would be cheaper.

7.5.2 Hardware selection and constraints

Next, we choose hardware. The simplest comparison is usually CPU vs low-end GPU vs high-end GPU for the same service path and traffic profile.

Small models with light per-request compute can be cost-effective on CPU. Larger models, or models that benefit from batching, often need GPU to hit latency or throughput targets.

Constraints matter as much as raw speed: GPU memory limits batch size and concurrent model instances, host RAM limits preprocessing buffers, and interconnect bandwidth affects handoff costs. Good sizing starts from measured traces plus headroom for bursts.

7.5.3 Warm vs cold start behavior

In production, we usually scale up when traffic rises and scale down when traffic falls so we can meet demand without paying peak infrastructure cost all day.

But if we do so, we need to be aware that serving systems are not always warm. A cold start happens when a new replica loads artifacts and initializes runtime state. For large models, this can take seconds or minutes. If autoscaling reacts only after queue growth, we have a problem where queues grow faster than new replicas become ready.

So, we need to quantify startup and treat it as part of the design: preload artifacts, use strict readiness checks, warm critical paths before serving traffic, and keep a minimum warm pool for online APIs.

7.5.4 Utilization, cost, and capacity

Capacity planning is a utilization-risk tradeoff. Low utilization wastes money, but running near saturation creates unstable tail behavior.

A useful unit is cost per successful prediction at target SLO. A cheaper instance type can still be worse if timeout rate rises. Likewise, aggressive batching can raise throughput but still fail product latency goals.

Capacity plans should be scenario-based, not single-point estimates: baseline, expected peak, and failure/burst cases. For each one, we should estimate queue delay, autoscaling response time, and cost.

7.5.5 Evaluating infrastructure for inference

At this level, we ask: can this platform sustain the expected target traffic with acceptable latency, cost, and risk?

We should evaluate with both controlled load tests and production-like traces, since steady synthetic traffic can hide cold-start and burst behavior. A practical scorecard includes p95/p99 latency, queue depth and wait, autoscaling reaction time, cold-start duration, utilization, memory headroom, error rate, and cost per successful request.

We should also test failure paths on purpose: reduce replica count, inject feature-store latency, or fail one region and observe behavior. Infrastructure decisions are less about fastest-case performance and more about predictable behavior under pressure.

7.6 Batch mode

So far we mostly discussed online serving, where each request waits for a response. In batch mode, we process a bounded dataset over a time window and optimize for completion, reliability, and cost. A batch job might score 50 million users overnight, generate recommendations every hour, or classify an archive of documents for compliance.

Unlike online systems, we can explicitly control data partitioning and batch size up front. That gives us more freedom to push throughput, but it also shifts responsibility to job orchestration and failure recovery.

If a batch of data fails and retries, rerunning it should not create duplicate outputs or inconsistent aggregates. A practical pattern is to write outputs with a deterministic key such as (run_date, partition_id, entity_id) and treat writes as upserts. We should also capture run metadata (model version, feature snapshot version, code version) so outputs are auditable and reproducible.

Resource strategy in batch mode is different from online mode. We can use larger model batches, longer warmup, and more aggressive scheduling because per-item latency is not user-facing. But we still need to protect job deadlines. If a daily job must finish by 6:00 AM, we should plan capacity for worst-case input volume, not average volume.

For large runs, we usually combine horizontal data parallelism and model-level efficiency. For example, we might split 10 TB of inputs into 200 partitions, and run one worker per partition.

7.7 Putting it together

A good final decision process is staged:

  1. Shortlist four or five candidates from model-only evaluation.
  2. Test full inference-system behavior under realistic traffic patterns.
  3. Test infrastructure behavior under stress and failure scenarios.
  4. Compare candidates against explicit targets (quality floor, latency/throughput target, and cost budget), and document operational risks.
  5. Pick one primary design and one fallback design that is simpler or more robust under failure.

7.8 Key terms

  • accuracy-aware tuning: An iterative search process that tries quantization settings until accuracy meets a target.
  • activation: An intermediate value produced inside a model while processing an input.
  • AOT: Compilation performed before runtime requests arrive.
  • asymmetric quantization: Quantization that uses a nonzero zero point so integer values can represent shifted ranges more accurately.
  • batch inference: Prediction mode where a dataset is processed in grouped jobs without user-facing per-request waiting.
  • channel: One feature-map slice within a tensor.
  • cold start: Extra startup delay before a new or restarted serving replica reaches normal response time.
  • dynamic batching: Runtime batching of queued requests using limits such as maximum batch size and queue delay.
  • dynamic quantization: Quantization where weights are pre-quantized, but activation ranges are computed during inference.
  • eager mode: Execution mode where operators run immediately as code executes.
  • execution provider: A runtime backend that maps graph operations to hardware-specific kernels.
  • feature retrieval: Request-time lookup of model inputs from online data systems.
  • graph mode: Execution mode where computation is represented as a graph artifact.
  • grid: The complete set of thread blocks launched for a kernel.
  • JIT: Compilation performed during runtime, often when concrete inputs are known.
  • kernel: A low-level hardware-executed routine for a specific operation.
  • latency: The time from request submission to response.
  • NDCG: A ranking metric that gives higher value to relevant items placed near the top.
  • online feature computation: Request-time computation of features from current context or recent events.
  • online inference: Prediction mode where each request is served while a caller waits for the response.
  • per-channel quantization: Quantization where each channel has its own quantization parameters.
  • per-tensor quantization: Quantization where one parameter set is shared across the full tensor.
  • quantization-aware training (QAT): Training that simulates quantization during forward passes so deployment-time quantization causes less quality loss.
  • quantization: Mapping floating-point values to lower-bit numeric representations.
  • reduced precision: Using lower-precision numeric formats to reduce compute and memory cost.
  • replica: An independent serving worker instance that processes requests in parallel with others.
  • SIMT: Execution model where many threads run the same instruction stream on different data.
  • static quantization: Quantization where activation ranges are calibrated before deployment.
  • straight-through estimator (STE): A gradient approximation used for non-differentiable quantization operations.
  • symmetric quantization: Quantization where integer range is centered around zero.
  • tensor: A multidimensional numeric array used in model computation.
  • thread: A lightweight execution unit that runs one instance of kernel code.
  • thread block: A group of threads that can synchronize and share on-chip memory.
  • throughput: Amount of inference work completed per unit time.
  • warp/wavefront: A lockstep scheduling group of threads in GPU execution.
  • weight-only quantization: Quantization strategy that compresses weights while keeping activations at higher precision.