Model training at scale
In this week’s lecture, we introduced techniques for training large-scale machine learning models.
We described a selection of techniques to allow us to train models that would not otherwise fit in the memory of a single GPU, or to fit a larger batch in memory:
- gradient accumulation
- reduced precision/mixed precision
- parameter efficient fine tuning (LoRA, QLoRA)
and we also talked about strategies for distributed training across multiple GPUs:
- distributed data parallelism, which allows us to achieve a larger effective batch size,
- fully sharded data parallelism, which allows us to train models that might otherwise not fit into memory,
- and model parallelism, including tensor and pipeline parallelism, which distribute computation across GPUs
In a future lesson, we will see how to offer some of these strategies as part of a model training “service” (e.g. as part an organization’s core ML capabilities).