Model training infrastructure and platform

Last week, we introduced techniques for training very large models. This week, we discussed the infrastructure and platform requirements to support that type of large model training, as well as to support the training of many models by many teams. We focused specifically on experiment tracking, versioning and reproducibility, and scheduling training jobs.

Using OPT-175B as an example, we identified some of the things we would need in an experiment tracking service:

  • Save details of every run, including model checkpoints, hyperparameters, and code version.
  • Monitor ML measures (e.g. loss).
  • Monitor infrastructure and system health, including hardware and software systems required for training.

Then, we discussed requirements for a ML training job scheduler:

  • Allocate resources effectively, in a way that is aligned with business priorities but also avoids underutilization.
  • Usability, flexibility, and observability for users and adminstrators.

and we introduced some scheduling and placement policies, most of which originated in the world of high performance computing.

Lab assignment

Reading