Model training at scale

In this week’s lecture, we introduced techniques for training large-scale machine learning models.

We described a selection of techniques to allow us to train models that would not otherwise fit in the memory of a single GPU, or to fit a larger batch in memory:

gradient accumulation
reduced precision/mixed precision
parameter efficient fine tuning (LoRA, QLoRA)

and we also talked about strategies for distributed training across multiple GPUs:

distributed data parallelism, which allows us to achieve a larger effective batch size,
fully sharded data parallelism, which allows us to train models that might otherwise not fit into memory,
and model parallelism, including tensor and pipeline parallelism, which distribute computation across GPUs

In a future lesson, we will see how to offer some of these strategies as part of a model training “service” (e.g. as part an organization’s core ML capabilities).

Model training at scale

Lab assignment

Reading