DevOps for ML systems
In this week’s lecture, we reviewed some of the basic ideas of DevOps:
- Continuous integration/continuous delivery
- Version control for everything
- Infrastructure as code
- Proactive monitoring and logging
and we discussed how use of DevOps practices can increase deployment frequency, and decrease lead time for changes, change failure rate, and time to restore service.
We also described some ideas of cloud native computing, which marries DevOps and cloud computing.
Although DevOps has well-developed ideas for managing code and infrastructure, we noted that it does not necessarily have them for models and data, which are an integral part of machine learning systems. Thus, the need for MLOps, which is: DevOps + the application of DevOps ideas to models and data.
We introduced a MLOps lifecycle for a machine learning system, and then we discussed (with reference to Uber as a case study) the core organization capabilities that would be needed to support this lifecycle across many teams and models within the organization.
Finaly, we reviewed the first few stages of the lifecycle in more depth -
- what happens in that stage
- the inputs and outputs of that stage
- and the core organization capabilities it relies on.
For the rest of the semester, we’ll work with a different one of those core organization capabilities in each unit.
Lab assignment
Due 4/29
Resource usage notes for this lab assignment:
- You will do this lab assignment on KVM@TACC, which does not require reservation.
- You can do the parts in any order, but you should only do one at a time - you must delete resources from a previous part before you start another part.
- Your resources may not be “active” for more than eight daytime (8AM - 11:59PM) hours. They may be deleted by course staff otherwise.
- Delete your compute instance as soon as you are done with the experiment, to free the resources for other students.
- This lab assignment includes one long unattended setup step (where Ansible installs and configures a Kubernetes cluster) in the middle. The recommended timeline is:
- Do notebooks 0, 1, 2, 3 (takes about an hour if you read through it in advance)
- Start notebook 4 running and, once you are convinced that it is going OK, leave it running. Go do something else - go to class, do some other coursework, etc. - for a couple of hours
- Then, continue with notebooks 5, 6, 7, 8 (these will take a couple of hours)
Reading
-
Most of this lesson is based on: Khalid Salama, Jarek Kazmierczak, Donna Schut. “Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning.” Google Cloud Whitepaper, May 2021. PDF
-
We also discussed a case study: Jeremy Hermann. “Meet Michelangelo: Uber’s Machine Learning Platform”. Uber blog, September 2017. URL