Model training at scale

In this week’s lecture, we introduced techniques for training large-scale machine learning models.

We described a selection of techniques to allow us to train models that would not otherwise fit in the memory of a single GPU, or to fit a larger batch in memory:

  • gradient accumulation
  • reduced precision/mixed precision
  • parameter efficient fine tuning (LoRA, QLoRA)

and we also talked about strategies for distributed training across multiple GPUs:

  • distributed data parallelism, which allows us to achieve a larger effective batch size,
  • fully sharded data parallelism, which allows us to train models that might otherwise not fit into memory,
  • and model parallelism, including tensor and pipeline parallelism, which distribute computation across GPUs

In a future lesson, we will see how to offer some of these strategies as part of a model training “service” (e.g. as part an organization’s core ML capabilities).

Lab assignment

Reading