Are you an educator who wants to use some of this material in your own class? Check out the instructor guide.

Machine Learning Systems Engineering and Operations

Fraida Fund

Building deployable, reliable, and scalable machine learning systems involves a lot more than just training a model. In this graduate course on machine learning systems engineering and machine learning operations (MLOps), students will learn techniques for designing, developing, evaluating, deploying, monitoring, and updating production-ready machine learning systems at scale.

This course covers the following topics:

  1. Challenges and basic principles of machine learning systems engineering and operations.
  2. Overview of cloud computing.
  3. DevOps and continuous X for ML systems (integration, training, deployment, testing, monitoring).
  4. Large scale data systems.
  5. Model training at scale.
  6. Model training infrastructure and platforms.
  7. Model serving.
  8. Monitoring and evaluating ML systems.
  9. Safeguarding ML systems.
  10. Using commercial clouds.
  11. Other topics (tentative): GenAI/LLMOps, RAG, Agents and MCP.

Students will learn through a combination of lectures, case studies, guided lab assignments on the ChameleonCloud research infrastructure and on commercial clouds (GCP, DigitalOcean), and a final project.