Model serving

This week, we moved on to the next stage of the ML system lifecycle: model serving.

We said that a model serving service:

  • may be able to prepare batch predictions in advance of when they are needed, but only if the data is available in advance. Furthermore, in some cases it may be wasteful if we don’t need all of the predictions.
  • or, it will be required to make online predictions in real time, when a user or autonomous system is waiting for its response.

and we further noted that the model serving service may live in the cloud (in which case it is subject to network conditions) or on an edge device (in which case it should be small, and has access to limited compute power).

In developing a model serving design, our goals are usually: low latency (online) or high throughput (batch), low cost, and accurate results. However, we will often have to make compromises in one area to gain an advantage in another.

With this in mind, we described both model-level optimizations and system-level optimizations we could use.

Model-level optimizations:

  • The choice of model or foundation model is likely to have the greatest impact on latency and throughput! A smaller model, or a model architecture (e.g. a MobileNet, a YOLO) that is specifically designed for speed, it likely to be much faster than a very large model, but it may be less accurate.
  • We said that many of the remamining model-optimizations rely on us first compiling our model into a graph. Then, it becomes possible to apply graph optimizations such as:
    • eliminating operations on constants
    • fusing operations
    • transforming primitives from a less-customized implementation to an implementation that is customized for the specific hardware on which the model will be executed
    • and optimizing these implementations considering factors such as the memory layout or memory access characteristics of the target device. (We looked at one example, from How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog).
  • We also consider quantization or reduced precision: either quantization aware training or post-training quantization (and, it can be static or dynamic).
  • and finally, we mentioned two other model-level optimizations: pruning and knowledge distiillation.

System-level optimizations:

  • We mentioned warm start vs cold start, which is typically a cost vs latency tradeoff.
  • We discussed concurrent model execution, both of different models and parallel instances of the same model.
  • We talked about dynamic batching.
  • and, ensembling models that perform a task together.

Finally, we shifted gears and used LyftLearn Serving as an example to talk about what an organization would want out of a model serving platform:

  • support many different ML frameworks,
  • isolate different models owned by different teams, so that they do not affect one another,
  • and make it easy for ML teams to create a deployment that will perform well for their particular model and use case, e.g. with customized templates.

Lab assignment

Reading