6  Infrastructure and platforms for training

6.1 Preview

In the last unit, we introduced some techniques for training very large models. This week, we will discuss the infrastructure and platform requirements to support that type of large model training, as well as to support the training of many models by many teams.

Using OPT-175B as an example, we will identify some of the things that are different about training “at scale” vs. training toy models for practice and learning, and describe infrastructure and platforms that support the additional requirements of training at scale.

6.2 Training for 3 minutes vs. 33 days

Released in May 2022, OPT-175B was one of the first open weight and open source large language models at this scale, with the stated goal of democratizing research on large language models.1 But perhaps most importantly for our purposes, the entire process of training this model is documented in great detail in a 114-page OPT-175B training logbook,2 and in several presentations by its developers.

(You may also be interested in this shorter version of Susan Zhang’s presentation.)

This example highlights some differences between training models that take a few minutes to train, vs. training models that take days or weeks:

  • Testing training code: Some of the early effort/hyperparameter tuning runs for OPT-175B were wasted because they later realized there was a bug in their training code, in the tensor parallelism implementation. This illustrates why we need testing before expensive runs: unit tests for core training logic, smoke tests for end-to-end wiring, and canary runs on real data with strict limits.
  • Monitoring: Live monitoring of model metrics (loss etc.) helped the team identify major problems in an initial “full scale” training run, so they could decide to terminate it and go in a different direction (switching out run 11 for run 12 lineage). Live monitoring of infrastructure made it possible to keep training going, by switching out failed nodes.
  • Experiment tracking: The team ran many experiments, and their “final” training run involved multiple restarts (stop and resume from checkpoint) with different settings (hyperparameters, etc.) in each. They tracked not only numeric run outputs, but also the intent behind each experiment in the logbook, so later decisions had context.
  • Hyperparameter tuning: For small models, we can often do broad sweeps. But with such a large model, the OPT-175B team did not have time for extensive hyperparameter tuning. Further complicating matters, the settings that worked well at smaller scale weren’t necessarily appropriate at larger scale.
  • Resource sizing and cost control: The OPT-175B team was “given” 1024 GPUs and 3 months, but the logbook makes clear this came with strong cost pressure and utilization pressure. There was a strong imperative to always be running something useful, not letting expensive nodes sit idle.
  • Scheduling and resource management: Although we do not always see every queue decision directly, the logbook shows continuous operational resource management work behind the scenes: updating node assignments, removing unhealthy nodes, and requeueing/resuming distributed runs.

These are relevant even if we are not pre-training a large language model from scratch. We will explore them in more detail in this chapter.

6.3 Testing training code

Some of the early effort/hyperparameter tuning runs for OPT-175B were wasted because they later realized there was a bug in their training code, in the tensor parallelism implementation. This illustrates the need to systematically test model training code, just as we might test code that implements application logic.

If a training run takes a few minutes, it’s not a big deal if there is an error in training code; you can just re-run it. When training runs for days or weeks, a broken run is expensive and can even kill a project. Therefore, we should treat training code like production code and test it before we launch long jobs. The goal is not to prove the model is good, but rather, that the training job is wired correctly and can survive the kinds of failures we know will happen.

We can see this at work in the code base that trained OPT-175B:

First, we should test the model and loss wiring. These checks should be automated and run every time the code changes, not done by a person scanning print statements. The goal of these tests is to catch common mistakes like mismatched tensor shapes, a loss that does not connect to the model outputs, or data not moved to GPU. We keep these tests small so they run in seconds and fail fast.

The general pattern for these unit tests is:

  • We set a deterministic seed (e.g. with torch.manual_seed)
  • create a deterministic input tensor,
  • run it through the model forward pass and, when needed, a backward pass,
  • and then check expected behavior.

Sometimes we assert properties in isolation (for example, output shape and dtype, finite values, non-zero gradients). Other times we compare outputs or gradients to another path that should be equivalent. In Pytorch, we can implement this with torch.manual_seed (and related CUDA seed settings) for reproducibility and torch.testing.assert_close for robust tensor comparisons with explicit tolerances.

We should also test checkpoint and resume. Since we are almost certainly going to need to stop and resume training, this test is very important. This means we start a run, save a checkpoint, stop the process, and then resume from that checkpoint in a brand-new process. We should verify that metrics continue as expected and that we do not accidentally reset optimizer state or data order. Also, if using distributed training with sharded state, we need to make sure that checkpoint logic first gathers all shards before saving a full checkpoint, and that all workers can restore consistently and continue from the same global step.

Next, we will need an integration smoke run. A smoke run is a minimal job that touches data loading, training, logging, and checkpointing, on a very small number of samples. It is not meant to learn anything useful. It is meant to prove the pipeline works end to end before we schedule a long run.

Failure mode tests are another layer of safety, given that we will probably encounter some failures. We can simulate OOM, low disk space, or transient network errors and verify that the result is acceptable (e.g. fail with notification, retry, etc.). These tests are important because long jobs are more likely to hit rare errors that a short run will never see.

Finally, we will run a canary training test. A canary training test is different from a smoke run because it uses real data and real configuration, just with strict limits. A smoke run proves the pipeline is connected. A canary run proves the pipeline still behaves well on actual data before we spend days of compute, for example by checking that loss decreases over a few epochs.

6.4 Monitoring

The OPT-175B team treated monitoring as active operations, not passive dashboard viewing. They spent large parts of the day watching TensorBoard curves and system signals, then making real-time decisions based on what they saw: pausing unstable runs, restarting from checkpoint, lowering learning rate when loss behavior looked risky, and swapping out unhealthy nodes so training could continue. “Babysitting” a model training run isn’t only for multi-week training jobs, though, it also applies at moderate scale.

When a training run lasts for hours or days, monitoring is how we avoid finding bad news too late. We are continuously asking three simple questions: is the run healthy, is it actually improving the model, and are we using expensive resources well?

It helps to watch on two time scales. In the fast loop (seconds to minutes), we watch operational signals like node status, GPU utilization, GPU memory pressure, data loading speed, step time, communication overhead, and restart events. In the slower loop (evaluations to epochs), we watch learning signals like training and validation loss, task metrics (for example perplexity or accuracy), and instability patterns such as repeated loss spikes or exploding gradients.

We should also look at both per-worker and whole-run views. Per-worker dashboards catch local issues, like one slow data shard or one unhealthy node. Run-level dashboards tell us whether the overall job is still on track. Together, those two views help us tell the difference between a local glitch and a systemic problem.

Monitoring only helps if we connect it to action. So we pair dashboards with alerts and runbooks: for example, “validation loss has not improved for N evaluations,” “step time regressed by more than 20%,” or “GPU utilization below 50% for 10 minutes.” Each alert should map to a concrete response, such as checking data pipeline stalls, reducing communication bottlenecks, or replacing a bad worker.

Finally, monitoring is also a cost-control tool. A run can look alive while still wasting money. We therefore track cost-relevant signals directly: idle accelerators, queue delays, repeated retries, and time spent below expected throughput. This lets us decide early whether to continue, retune, or stop.

We will revisit monitoring in a later chapter.

6.5 Fault tolerance and recovery

The OPT-175B logbook makes it clear that failures were frequent and operationally expensive, not rare edge cases. The notes repeatedly document incidents such as “Node down - requeue for 12.52a”, “We need to swap out 68.”, and “Node failure at ~1950 steps”, alongside actions like “Replaced node-[1,39] in cloud UI”. This is exactly why fault tolerance and recovery logic has to be a first-class part of training infrastructure: long runs will encounter faults, and so we need a reliable procedure to detect faults, swap hardware, resume from checkpoint, and continue safely.

The OPT paper quantifies this pattern: “In total, hardware failures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months… Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.” In plain English, failures were routine, and the team survived by repeatedly replacing faulty machines and relying on restart automation.

The larger the scale, the more necessary it becomes to combine several fault tolerance measures so recovery is predictable and fast. Some of these measures include:

  • Regular checkpointing to durable storage: save model, optimizer, scheduler, RNG, and dataloader state often enough that a restart does not lose hours of progress.
  • Automatic restart and requeue policy: when a node or worker fails, automatically restart from the latest checkpoint with retry limits and backoff.
  • Node health checks and quarantine: detect unhealthy hosts early, drain/remove them from the job, and replace them with healthy nodes.
  • Worker-group recovery for distributed jobs: if one worker in a synchronized training group fails, recover the full group coherently so ranks stay consistent.
  • Resume-safe input pipeline: track data position and epoch progress so resumed runs do not silently skip or double-process data.
  • Fault-injection drills: deliberately kill workers or interrupt nodes in test runs to verify that recovery really works before long production runs.

6.6 Experiment tracking

In the OPT-175B example, they did not run their training code once. They ran many experiments! For example, they started with what they called “kitchen-sink” runs on a smaller model to test their training code and various details of their model implementation, as well as different configurations of datasets. This helped them identify some serious problems with their dataset before kicking off the month-long full scale run. They also had a few larger-scale runs with different training hyperparameters. All of these experiments were closely tracked and their details logged, in order to document everything about them and use them to draw meaningful conclusions.

In at-scale ML training, logging is very cheap relative to training. When you train a small model on a laptop, if you forget what the final loss was in a particular configuration, or if you didn’t bother checking memory usage during training, you can just run it again and find out. For larger training jobs, though, you don’t want to waste any compute - so you need to keep careful track of every single training run. Typically, you should use an experiment tracking service, like MLflow34 (which is open source, can be self hosted) or Weights & Biases.

First, let’s clarify what we mean by a training run. We will define an experiment and a run. An experiment is about a single modeling job. A run is one attempt to answer it with a specific setup. Usually we will have many runs in each experiment.

flowchart TD
    EXP["Experiment: optimizer-ablations-v1<br/>Goal: improve validation loss"]

    R0["Run 0 (baseline)<br/>lr=3e-4, batch=1024, weight_decay=0.01"]
    R1["Run 1<br/><b>changed: lr=1e-4</b><br/>batch=1024, weight_decay=0.01"]
    R2["Run 2<br/><b>changed: batch=2048</b><br/>lr=3e-4, weight_decay=0.01"]

    EXP --> R0
    EXP --> R1
    EXP --> R2

    classDef box fill:#e5e7eb,stroke:#222,stroke-width:1px,color:#111;
    class EXP,R0,R1,R2 box;

Furthermore, we will distinguish parameters, metrics, and artifacts. Parameters in this context means hyperparameters - what are the configurations that we tried. Metrics tell us what happened during the training run. Artifacts are the concrete outputs that we can use to reproduce a training run.

Parameters we should log include:

  • Learning rate, scheduler type, warmup steps, and weight decay.
  • Batch size (global and per device), gradient accumulation steps.
  • Optimizer choice and settings (for example, Adam betas and epsilon).
  • Model hyperparameters (for example, layer count, base model or starting checkpoint).
  • Training code version and data snapshot identifier.

We will use these to compare runs fairly, to reproduce the same run, and to identify which configuration changes had a meaningful positive or negative effect.

Metrics we should log include:

  • Training and validation loss or other “optimizing metrics” over steps.
  • Throughput (e.g. samples per second) and step time.
  • GPU and CPU memory usage and utilization.
  • Gradient norm and learning rate over time.

We will use these to decide when to stop or continue to run, and to diagnose training problems .

Artifacts we should log include:

  • Checkpoints at regular intervals and the final model weights.
  • Environment snapshot (for example, GPU details, versions of installed Python packages).
  • Full training configuration file.
  • Evaluation outputs (for example, confusion matrices, or sample generations).
  • Run logs and any plots generated during training.

Artifacts will be used later for more detailed diagnostics, to resume training from failure, hand off a model to downstream teams, and to reproduce results.

All of these should be logged to persistent storage. Usually, some kind of relational database will be used to track experiments and runs, and to store metrics and parameters. This relational database should keep its data on a persistent volume. Artifacts are often logged to object storage (model checkpoints in particular can be very large, so artifact storage must be carefully managed).

Automated logging is essential, but it’s not always enough. You may have noticed that in the OPT-175B training effort, the team logged experiment curves in TensorBoard, but they also kept a human-written logbook of decisions and incidents. That narrative record matters because curves alone do not explain intent, for example why a hyperparameter changed, why a restart happened, or why a run was stopped early. Even a short note on important runs is very helpful after the fact.

There are some “gotchas” to be aware of. A common tracking assumption is that one tracked process maps cleanly to one run. That assumption is violated in two ways. First, restarts: a single logical run may span multiple process lifetimes, so we need to decide how to track these. Second, distributed training: one logical run may involve many worker processes. We may collect hardware signals across all GPUs (utilization, memory, throughput), but we should use one designated writer for run-level records (parameters, training metrics, and artifacts) to avoid duplicates and conflicting writes. For FSDP specifically, artifact logging also needs explicit checkpoint semantics: if we want a single deployable model file, we gather shards into a full state dict before logging; if we keep sharded checkpoints, we should log them as sharded artifacts and document that restoring them requires FSDP-aware loading.

6.7 Hyperparameter tuning

The OPT-175B team tried some ablations, but ended up trying to follow as closely as possible published settings by other teams. The reason is practical: once a single training run is expensive, we cannot treat tuning like an exhaustive search problem. At that point, hyperparameter tuning becomes a resource allocation problem as much as a modeling problem.

A useful way to structure tuning at scale is to separate two decisions. First, a search strategy proposes configurations to try (for example random search, Bayesian optimization, or bandit-style search). Second, a scheduling policy decides how much budget each trial should get, based on intermediate metrics. In other words, one component chooses what to try, and another decides when to continue, pause, or stop trials early.

flowchart LR
    SS("**Search Spaces**<br/><i>Which parameters?</i>")
    TR("**Objectives**<br/><i>Which objective to tune?</i>")
    SA("**Search Algorithms**<br/><i>How to choose the next configuration?</i>")
    SC("**Schedulers**<br/><i>When to stop?</i>")
    TI("**Trials**<br/><i>Run experiments</i>")
    AN("**Analyses**<br/><i>Analyse results</i>")

    SS --> TR
    TR --> TI
    SA --> TI
    SC --> TI
    TI --> AN

    classDef box fill:#e5e7eb,stroke:#222,stroke-width:1px,color:#111;
    class SS,TR,SA,SC,TI,AN box;

In this context, a search algorithm answers one question: what configuration should we try next?

  • Grid search: We pick a short list of values for each hyperparameter and try every combination. If we test learning rate {1e-3, 1e-4, 1e-5} and depth {12, 24, 36}, that is 9 runs. This is easy to explain and easy to audit, but it burns budget quickly because combinations explode.
  • Random search: We sample each run from ranges instead of enumerating every combination. Suppose we run 60 trials with lr ~ loguniform(1e-5, 1e-3), weight_decay ~ loguniform(1e-6, 1e-2), and batch_size in {64, 128, 256}. After those 60 runs, we may see most strong configurations cluster near lr around 2e-4 and weight_decay around 8e-4, which gives us a useful region to refine. This is often better than grid search for the same number of runs, because grid search spends equal effort on every dimension even if one hyperparameter turns out to be mostly irrelevant; with random search, each trial still gets a unique sampled value for every hyperparameter, so we usually cover the important dimensions more efficiently.
  • Bayesian optimization: We start with a small seed set, then choose each next trial using observed results from earlier trials. If the first 12 runs suggest a promising region around lr=1.5e-4 to 3e-4 and weight_decay=5e-4 to 2e-3, the next proposals spend more trials near that region (for example lr=2e-4, wd=8e-3 and lr=1.6e-4, wd=9e-3) while still testing uncertain regions (for example lr=7e-5, wd=4e-3). That balance is the key tradeoff: exploitation helps us improve quickly where current results look strong, while exploration keeps us from getting overconfident too early and missing better settings elsewhere.
  • Evolutionary search: We keep a population of configurations, keep stronger ones, and generate new candidates by mutation or crossover. We might begin with 24 configs over (lr, wd, dropout), keep the top 8 after short runs, then create new candidates with mutations such as lr *= 1.2, wd *= 0.7, and dropout += 0.05 (within allowed bounds), plus a few random immigrants each generation to preserve exploration.

Comparison of grid, random, Bayesian, and evolutionary hyperparameter search patterns.

A scheduler answers a different question: now that a trial is running, should we keep spending budget on it?

  • FIFO (no early stopping): We run each trial to completion in submission order. If we launch 40 trials and each one trains for 50 epochs, every trial gets the full 50 epochs whether it looks promising or not. This is simple and predictable, but expensive when many configurations are clearly weak early on.
  • Median stopping rule: We compare a trial to what previous trials achieved at the same training point, and stop it if it is clearly behind. For example, at epoch 10, if a trial’s validation loss is worse than the historical median epoch-10 loss, we stop it and free those resources. This is easy to apply, but it can make bad decisions when metrics are noisy or when good models learn slowly at first.
  • Successive halving: We start many trials with a small budget, then repeatedly keep only the stronger fraction and give them more budget. For instance, with eta=3, we can run 81 trials for 1 epoch, keep 27 and run to 3 epochs, keep 9 and run to 9 epochs, then keep 3 and run to 27 epochs. This usually saves a lot of compute, but it depends on early metrics being trustworthy.
  • Population-based scheduling: We keep many trials running, and at fixed intervals we replace weaker trials with copies of stronger ones, then mutate a few hyperparameters before continuing. A common pattern is to check every 5 epochs, copy weights from top trials into bottom trials, and perturb learning rate by factors like x0.8 or x1.2. This can discover good hyperparameter schedules during training, but it is operationally heavier and more checkpoint-intensive.

Search algorithms and schedulers should be tuned together. A strong search algorithm with no early-stopping scheduler wastes compute, while an aggressive scheduler with a poor search strategy prunes many trials but may miss good regions. Many good search algorithms and schedulers are available in Ray Tune, Optuna, and HyperOpt.

6.8 Resource sizing and cost control

In the OPT-175B logbook, the team wrote:

Letting nodes idle costs $2500/hour so it is strongly discouraged.

That one number captures the core platform concern for training infrastructure: compute is not just a technical resource, it is a budget expressed in dollars. At large scale, every hour of idle accelerators can erase the value of many modeling decisions.

We therefore treat model size and resource size as a key tradeoff. A larger model may improve quality, but it also raises training time, memory pressure, failure exposure, and serving cost. Before we scale up, we should ask a concrete question: if this run costs 2-3x more, do we expect a quality gain that is actually worth 2-3x more in both training and operations? In many projects, a smaller model we can train three times this week is more useful than a larger model we can afford to run once.

During development, we should scale in stages:

  • Start with cheap settings for fast iteration: a smaller model variant, a small but realistic data slice, and one low-cost GPU.
  • Move to medium settings only after the pipeline is stable: loss decreases as expected, metrics are being logged correctly, and checkpoint/resume works.
  • Use expensive multi-GPU or multi-node jobs only for final validation, required ablations, and production retraining.

This workflow only works if we manage compute lifecycle aggressively. We bring resources up when we have queued work, and we tear them down when runs finish. We avoid paying for idle instances just to save setup time. Instead, we persist checkpoints, configs, logs, and artifacts to durable storage, so resume-from-checkpoint is a normal workflow rather than an emergency workflow.

To size resources pragmatically, we profile early and reuse measurements. A simple baseline is one representative epoch where we record:

  • startup overhead (provisioning, environment setup, data mount);
  • step-time distribution (not just average) and full epoch duration;
  • throughput (samples or tokens per second) and GPU utilization;
  • memory headroom and where the input pipeline stalls.

With those numbers, we can estimate wall-clock time for larger runs, choose instance bring-up windows deliberately, and avoid over-provisioning “just in case.” We can also see when adding more GPUs stops paying off because scaling efficiency drops (for example, 2 GPUs gives 1.8x speedup, but 4 GPUs gives only 2.4x). That is the practical goal: right-size resources by phase, and scale only when the expected quality gain justifies the marginal cost.

6.9 Scheduling and resource management

In the OPT-175B effort, the team used Slurm (a cluster scheduler) to place each training job on a set of compute nodes. In the logbook, we can see this process directly: the NodeList changes over time, and nodes move between alloc (currently assigned to jobs), idle (available for scheduling), and drain (temporarily taken out of service because they are suspected unhealthy). The logbook entries show both job submission and live job repair, where operators update node assignments and resume runs with Slurm commands such as scontrol.

At scale, training is not “run code on 1,024 GPUs.” It requires continuous resource management: keeping a healthy node set, removing bad nodes, swapping in spares, and requeueing jobs so distributed training could continue when part of the cluster failed. Furthermore, although we didn’t see much of this in the OPT-175B example, when a cluster supports many teams at once, we also need clear rules for how jobs are ordered and how they are placed on nodes.

This is why we need infrastructure and platforms for training, not just scripts that launch processes. We need a platform that keeps track of incoming jobs, knows what each job needs (like GPUs, memory, and node count), watches what is happening across the cluster, and then makes two connected decisions: what to run next (scheduling) and where to run it (placement).

If scheduling and placement decisions are made poorly, we can create gridlock. For example, imagine two distributed jobs in the queue, and each job needs two nodes. If we start one half of job A on one node and one half of job B on another node, neither job has enough resources to make progress. Both jobs wait, both nodes are occupied, and the queue stalls until something is manually interrupted.

Fortunately, high performance computing has been dealing with scheduling and placement for decades, so we do not have to start from scratch. The table below describes some standard scheduling policies.

Policy figure What the figure illustrates
FIFO scheduling FIFO: in FIFO, we run jobs in the order they arrive. This feels fair and is easy to reason about, but if a very large training run gets in line first, many short jobs behind it can wait a long time even when they could finish quickly. So, it can increase overall waiting time.
Backfill scheduling Backfill: backfill keeps the promise to earlier jobs, but it also uses idle gaps for smaller jobs that can finish before the reserved start time. This helps us avoid wasting GPUs while still honoring queue order. Notice, though, that this requires advance knowledge of the resource requirements and expected run time of a job.
Fairshare scheduling Fairshare: fairshare checks who has used the cluster recently, not only who clicked submit first. If one team has already consumed a lot of GPU time, its next jobs will get a lower priority in the queue than jobs submitted by other teams. This automatically prioritizes e.g. a single training job over an extensive parameter sweep involving dozens of training jobs.
Priority scheduling Priority: priority scheduling runs jobs by policy importance, such as incident response, production deadlines, or business-critical workloads. This means an urgent retraining job may run before older exploratory jobs.
Preemption scheduling Preemption: with preemption, a high-priority job can interrupt lower-priority work and take resources immediately. This only works well when interrupted jobs can safely restart from checkpoints.
Time-slice scheduling Time-slice: time-slicing rotates execution windows across jobs so many users can get quick responsiveness. This can be helpful for short interactive work, but it is a poor fit for distributed training: workers in a distributed job need to sync gradients at each step, and if different workers get compute at different times, some workers sit idle waiting at all-reduce barriers. This waiting can dominate step time and make training much slower than expected.
Gang scheduling Gang scheduling: gang scheduling makes all required workers for a distributed job available at the same time. We do this because partial allocation is usually useless for synchronized training: if some workers are missing, the job cannot make real progress.

We also need to think about how we place jobs on nodes. In placement, we are balancing two goals that pull in opposite directions. We want load balancing, so the cluster stays busy instead of overloading a few nodes while others sit idle. We also want locality, so parts of the same job that communicate frequently are placed close together, which reduces network traffic and synchronization delay. In practice, pushing harder on one goal often hurts the other, so placement policy is about choosing the right compromise for the workload.

Again, we have strategies from HPC:

Placement figure What the figure illustrates
Spread placement Spread: place workers across more nodes to reduce contention and improve fault isolation. For example, if we run four independent inference replicas, spreading them across four nodes means one node failure removes only one replica, not all four.
Pack placement Pack: place workers on fewer nodes to improve locality and leave more nodes free for other jobs. For example, if we run a 4-GPU training job, placing all workers on one 4-GPU node usually reduces gradient communication delay because traffic stays on fast intra-node links instead of crossing the network.
Balanced placement Balanced: distribute workloads to keep utilization even across nodes and avoid hot spots. This is useful on shared clusters with many medium jobs, where we want to avoid overloading a few nodes while others sit mostly idle.
Group placement Group: place related tasks together as a bundle so they can communicate efficiently and share node-local resources. For example, we may keep a trainer and an evaluator sidecar on the same node so both can reuse local checkpoints and caches.
Object-locality placement Object-locality: schedule compute near the data or objects it will read, to reduce cross-node transfers. For example, if large embedding batches are already cached at one worker, it is often faster to run the next stage there than to copy those embeddings across nodes. (Usually we place compute and then bring the data; this strategy represents the reserve, or “bring the compute to the data”.)

For standard HPC environments, Slurm (Simple Linux Utility for Resource Management) is a de-facto workload manager and job scheduler, and it is still widely used for ML. But Slurm alone is often not the best fit for modern ML development loops. ML teams usually run many short experiments, frequent restarts, and mixed workloads (long distributed training, short evaluations, and hyperparameter trials). Some papers about ML worklaods shed some light on their properties:

  • In a large production ML platform at Alibaba (6,742 GPUs), Weng et al. observed5 that many tasks used only a small fraction of a GPU at a time (median SM usage around 0.042 of a full GPU), while a smaller set of tasks needed entire high-end GPUs with strict placement requirements such as locality and gang scheduling. That mix makes scheduling hard: one policy must handle both tiny fractional workloads and “all-at-once” distributed jobs. The same study also showed that queueing behavior can be counterintuitive. Around 9% of short-lived tasks spent more than half of their end-to-end time waiting in queue, and many workloads were recurring (about 65% of tasks repeated at least five times), which made runtime prediction useful for scheduling.
  • For LLM development, Hu et al. found6 an even stronger skew in a 4,704-GPU A100 datacenter trace: only 3.2% of jobs (pretraining) consumed about 94.0% of total GPU time, while evaluation jobs were most of the job count (92.9%) but consumed only about 0.8% of GPU time. They also reported very high median GPU usage for LLM workloads (about 99% compute utilization and about 75% memory utilization), which implies that techniques based on sharing one GPU among many light tasks are often less effective for LLM training than for earlier ML workloads. Furthermore, they found that LLM evaluation pipelines can burn wall-clock time outside actual GPU math. In one profiled evaluation workload, they measured substantial time in model loading/data preprocessing (about 29.5%) plus correctness testing (about 19.0%), leaving only roughly half the time in GPU inference.

As an alternative to general HPC platforms like Slurm, for ML workloads we may prefer a ML workload orchestrator, so scheduling and placement can be tied directly to ML workflow needs such as worker-group coordination for distributed training, retries, and checkpoint-based restarts. A platform that is optimized for ML-specific work can eliminate some pain points, like a 5-minute evaluation waiting behind multi-hour jobs, or a multi-day training job failing because of a hung node.

We will consider Ray as an example. Ray is a distributed computing and workload orchestration framework designed specifically for machine learning workloads.7 A Ray cluster is a set of worker nodes connected to a common Ray head node. The head node keeps track of cluster membership and resources and helps coordinate scheduling decisions (for example, which worker should run which task). The workers execute the actual computation, by carrying out units of work:

  • job: a job is one top-level application run on the cluster.
  • actor: an actor is a process that stays alive and keeps state in memory across calls.
  • task: a task is a one-time, stateless function execution.

Worker nodes also keep a local object store, for intermediate task outputs; these can stay local and be consumed locally by the next task, or be transferred to other workers that need them.

ray_cluster cp Control plane dp Data plane head Ray head node rw1 Worker 1 Tasks + actors ------- Object store head->rw1 rw2 Worker 2 Tasks + actors ------- Object store head->rw2 schedule rw3 Worker 3 Tasks + actors ------- Object store head->rw3 rw1->rw2 transfer objects storage Data repository rw1->storage rw2->rw3 transfer objects rw2->storage read/write rw3->storage

Ray is a powerful platform for distributed computing, but in this chapter, we will consider it mainly as an ML workload orchestrator. We can use its orchestration capabilities with different levels of integration:

  • Run a specific job on the cluster, with a defined environment. Sometimes we just want to run a training job without any modifications (e.g. take an existing PyTorch training script that we developed and tested in an interactive Jupyter environment, and run it on a GPU cluster). We can do this with Ray; we describe the runtime environment (dependencies and settings), point to the code we want to execute, and submit it with the right context (working directory and resource needs) using ray submit. The head node then queues and places that job when the requested resources become available. In managed cloud deployments (for example Ray on Kubernetes), the platform can also autoscale the cluster to provision those resources instead of waiting indefinitely.
  • Orchestrate a workload through ML framework integrations. Ray integrates with many ML frameworks (for example PyTorch Lightning, scikit-learn, XGBoost, and LightGBM) and adds an orchestration layer on top: after we submit a top-level job, Ray can run the actual distributed training as coordinated long-lived worker processes rather than repeatedly launching one-off tasks, while handling worker startup, distributed coordination, checkpointing, metric reporting, and failure recovery. For example, if we run an unmodified training script and one training worker node dies, the whole run often fails unless we build our own recovery logic; with Ray Train, workers report checkpoints to shared storage and the run can be resumed by restarting the failed workers (or worker group) from the latest checkpoint.
  • Manage many trials and allocate resource budget. When we integrate Ray Tune into a hyperparameter tuning workload, we can launch many candidate configurations in parallel and make policy decisions as intermediate metrics arrive. Instead of giving every trial the same budget, Tune can continue promising trials, stop weak trials early, and launch new configurations in the freed capacity. Ray then places the selected trials on available cluster resources according to their CPU/GPU requirements. For example, if we want to test 16 hyperparameter configurations but the cluster can only support 4 at a time, Tune can start four trials, stop poor performers after early epochs, and immediately reuse those resources for new trials, so the cluster spends more time on promising configurations and less time on dead ends.

A Ray cluster also incorporates the scheduling and placement decisions we discussed earlier, in addition to its ML-specific capabilities. In Ray, each job, actor, or task declares resource needs (CPUs, GPUs, and optional custom resources), and the scheduler places work based on availability and policy constraints. It first checks whether a node is feasible for the requested CPUs/GPUs/custom resources, then prefers nodes that already have related work or local data, and also tries to avoid overloading “hot” nodes by choosing from a top set of lower-utilization candidates. We can also override that behavior when needed: spread work across nodes for resilience, keep a worker group together for tightly synchronized training, pin work to a specific node when we need cached data or a special device, or target labeled node classes (for example accelerator type or zone). When deployed within Kubernetes, Ray can do priority scheduling and gang scheduling. It can also integrate with NVIDIA KAI for more ML-specific scheduling strategies.

6.10 System design

When we design training infrastructure, we are not only choosing where code runs. We are choosing how runs are launched, tracked, recovered, monitored, and stopped under real cost and reliability constraints. Those choices determine whether a team can train once, or train repeatedly and safely.

For this chapter, we can frame the design problem as: how do we run many training and tuning workloads on shared GPU infrastructure, with strong reliability and clear cost control?

Design tasks:

  1. Define workload classes (single training runs, distributed training runs, hyperparameter tuning sweeps, short evaluation jobs) and expected service levels for each (queue time, completion time, failure recovery time).
  2. Define run identity and metadata contracts (run id, code version, data snapshot, configuration, owner, purpose).
  3. Choose cluster management and orchestration layers (for example Kubernetes + Ray, or Slurm + orchestration layer) and define where scheduling versus placement decisions live.
  4. Define shared-resource policies (priorities, fair share, preemption rules, max concurrency, per-team quotas).
  5. Define checkpoint and recovery strategy (frequency, storage location, restart policy, retry budget, worker-group recovery behavior).
  6. Define experiment tracking and artifact management (parameters, metrics, artifacts, retention, lineage notes, intent logging).
  7. Define monitoring and alerting (training health, infrastructure health, cost signals, and runbooks tied to alerts).
  8. Define a progressive scaling workflow (small development runs, medium validation runs, large production retraining runs).
  9. Define a tuning workflow (search policy, trial scheduler policy, stop/continue criteria, reproducibility checks for top configurations).
  10. Define governance and operations boundaries (access control, auditability, reproducibility, ownership, on-call responsibilities).

Questions to ask:

  • What run types do we need to support, and what queue delay is acceptable for each?
  • Which failures are common in our environment (node loss, OOM, network instability, slow data pipelines), and how quickly must we recover?
  • What is the budget envelope per week or month, and what utilization target do we need?
  • How many teams share this cluster, and what fairness or priority rules are non-negotiable?
  • What level of reproducibility do we need for audits, scientific comparison, and rollback?
  • How much hyperparameter exploration is realistic under time and budget constraints?

Design decisions to make:

  • Job submission contract: Every run must declare runtime environment, resource requests, code entrypoint, and tracking metadata.
  • Scheduling and placement policy: Define default behavior for mixed workloads (short evaluations, long distributed runs, tuning trials), plus escalation rules for urgent jobs.
  • Fault tolerance policy: Define checkpoint cadence by workload type, retry limits, and exact restart semantics after worker or node failure.
  • Tracking policy: Define required run metadata and required artifacts before a run is considered valid.
  • Monitoring policy: Define fast-loop signals (GPU utilization, step time, restarts) and slow-loop signals (loss and quality trends), plus alert thresholds and runbooks.
  • Cost policy: Define when to stop low-value runs, when to resize resources, and when to defer expensive sweeps.
  • Tuning policy: Define default search strategy and default trial scheduler, including early-stop criteria and promotion rules.
  • Promotion policy: Define what must be true before moving from development-scale runs to expensive production-scale runs.

These design decisions should always be aligned to the requirements and constraints of the specific project, not copied as a fixed checklist. In general, we should match policy strength to risk and cost: low-cost exploratory work can use simpler scheduling and lighter process, while expensive or high-impact workloads need tighter controls for reliability, tracking, and recovery. This usually means making explicit choices about resource sizing and budget first - for example, deciding whether one GPU, a small multi-GPU job, or only CPU runs are justified - and then choosing matching workflow policies for queueing, checkpointing, monitoring, and tuning. A classical ML project with short, cheap runs can support broader experiments and simpler orchestration, while a large-model project with expensive runs should use narrower search spaces, stronger early-stopping and restart policies, and stricter cost guardrails. The goal is to choose the lightest infrastructure and process that still protects us from the most expensive failure modes at our scale.

6.11 Key terms

  • actor: A long-running stateful worker process that handles many method calls.
  • artifact: A file output from a run, such as a checkpoint, configuration file, or evaluation result.
  • canary training: A limited run on real data and configuration used to catch regressions before a full run.
  • checkpoint: A saved training state that allows a run to resume after interruption.
  • experiment: A named collection of related runs for one modeling objective.
  • fault tolerance: The ability of a system to continue and recover when components fail.
  • gang scheduling: A scheduling constraint where all workers in a distributed job must start together.
  • head node: The coordination node in a distributed system that tracks cluster state and dispatches work.
  • hyperparameter tuning: The process of testing multiple training configurations to find settings that improve model quality.
  • job: A submitted top-level unit of work in a compute system.
  • load balancing: Distributing work across available resources to avoid hotspots and improve utilization.
  • locality: Keeping related computation and data close together to reduce transfer and synchronization overhead.
  • metrics: Numeric measurements used to track model behavior and system performance.
  • ML workload orchestrator: A platform layer that coordinates ML jobs across scheduling, placement, and recovery.
  • node: A machine in a compute cluster.
  • parameters: In experiment tracking, configured input values recorded for a run, usually hyperparameters and other run settings.
  • placement: The decision of where to run tasks or workers in a cluster.
  • run: One execution of a training job with a specific code, data snapshot, and configuration.
  • scheduler: A policy that decides when work runs and how budget is allocated across running work.
  • search algorithm: A method that proposes new hyperparameter configurations to evaluate.
  • smoke run: A minimal end-to-end run used to verify that the training pipeline is wired correctly.
  • task: A short-lived stateless unit of remote computation.
  • worker node: A node that executes tasks or actors assigned by the scheduler.

  1. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. CoRR abs/2205.01068. https://arxiv.org/abs/2205.01068.↩︎

  2. Meta AI. 2022. OPT-175B Logbook. GitHub repository. https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf.↩︎

  3. Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the Machine Learning Lifecycle with MLflow. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. https://people.eecs.berkeley.edu/~matei/papers/2018/ieee_mlflow.pdf.↩︎

  4. Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, and Corey Zumar. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In DEEM ’20. https://people.eecs.berkeley.edu/~matei/papers/2020/deem_mlflow.pdf.↩︎

  5. Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). https://www.usenix.org/conference/nsdi22/presentation/weng.↩︎

  6. Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). https://www.usenix.org/conference/nsdi24/presentation/hu.↩︎

  7. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In OSDI ’18. https://www.usenix.org/conference/osdi18/presentation/moritz.↩︎