Evaluation and monitoring

In the lifecycle of a machine learning model, between “training” and “deployment to production” there is an evaluation stage! In this week’s lesson, we talked about evaluating a model, including:

  • evaluation on general optimizing metrics (e.g. loss, accuracy) and domain-specific metrics (e.g. perplexity, BLEU, ROUGE for text generation)
  • evaluation on those same metrics, for slices and populations of interest
  • evaluation on operational metrics (e.g. inference latency on a single sample, batch throughput, time/cost to retrain model, memory required for inference)
  • behavioral testing, with template-based unit tests to make sure that the model is robust to perturbations that should NOT change its output, and has the correct behavior for perturbations that SHOULD change its output
  • sanity-checking via explainability techniques (e.g. feature importance, SHAP, LIME, saliency maps, attention-based explanations)
  • and regression testing for previous “known errors”

The tests above should be run in an automated way as part of a “continuous X” pipeline, but we may also want human-in-the-loop testing (“red teaming”) or even GenAI-in-the-loop testing.

We further discussed use-case specific tests, with many examples; and tests of the overall model serving system, not just the model in isolation.

For a model that passes “offline” tests, we may then proceed to an online evaluation, including:

  • shadow testing, where user requests are duplicated to both the old system and new system (responses from the new system are not shown to users!)
  • canary testing, where a small fraction of user requests are served by the new system
  • A/B testing, where user requests are routed to either the old or new system, and we compare them on business-specific metrics. As an example, we looked at how Target did an online evaluation of a new model for bundled product recommendations. Through A/B testing, they were able to directly attribute an increase in “click-based revenue” to the new model.

Once a model is deployed in production, we will be concerned with prediction monitoring in production. We anticipate that over time, the performance of a model may degrade due to various types of drift, including covariate shift, label shift, and concept drift. When the model becomes “stale”, we will want to know that its performance has degraded, and re-train it on new data. Therefore, we must monitor the quality of predictions.

However, this is challenging because in production, we do not necessarily have ground truth labels for new data:

  • some use cases may have natural ground truth labels
  • sometimes we may have a delayed natural ground truth label, and in the meantime, we may be able to use a proxy label but it is not necessarily valid or helpful for optimizing
  • we may rely on user feedback, but it is often sparse and/or incorrect
  • for use cases where it is possible for a human to label new samples, we may get human labels for a random fraction of new samples, for samples where the model has low confidence, or for samples where the user provides feedback that the model was incorrect

Finally, we discussed many other tests of the overall machine learning pipeline.

Lab assignment

Reading