Data systems

In this week’s lesson, we noted that of all the previously identified “core organization capabilities” required to support large-scale machine learning, we have discussed all but:

  • Data processing: the capability for data transforms and featured engineering, on both structured and unstructured data, in batch and stream mode.
  • Data/feature store: the capability to share, discover, and reuse data and data pipelines

We discussed different types of data repositories:

  • Relational database: good for structured data with a pre-defined schema, when we need all the CRUD operations
  • Data warehouse: good for structured data with a pre-defined schema, when we will mostly read data
  • Document database: for structured or semi-structured data with a more flexible schema
  • Columnar database: good for structured data with a pre-defined schema, optimized for operations over columns (e.g. “get the average value of of this column”)
  • Data lake: appropriate for unstructured data without a schema, is most effective with a metadata layer on top
  • Data lakehouse: has a management layer on top of a data lake, to provide additional functionality

We also described the typical process of getting data into a data repository, with an ETL (Extract, Transform, Load) pipeline:

  • First, we extract data from one or more sources into a staging area (e.g. a local filesystem)
  • Then, we transform the data as required (organize, clean, compute offline engineered features)
  • Finally, we load the transformed data into a data repository.

We may use a workflow orchestrator (e.g. Airflow) to manage these pipelines.

For online (streaming) data, we won’t necessarily think about data as “living” in some sort of data repository. Instead, streaming data exists as it moves through a streaming data system. To work with streaming data, we will often have a broker who gets data generated by producers, and then makes it (reliably) available to one or more consumers.

A feature store may integrate all this functionality for ML systems, by taking batch data sources and streaming data sources and making them more readily discoverable and available for model training and inference.

Slides: Data systems

Lab assignment

Due TBD

This lab assignment is in three parts:

Lab: Part 1: Persistent storage on Chameleon Lab: Part 2: Batch data pipelines Lab: Part 3: Online data pipelines

The first part is released in “Preview” mode, until the last parts are also released; the due date will be set and the Gradescope submission will open when all parts are ready.

Resource usage notes for this lab assignment:

  • You will do this lab assignment on KVM@TACC, which does not require reservation.
  • You can do the parts in any order, but you should only do one at a time - you must delete resources from a previous part before you start another part.
  • Your resources may not be “active” for more than eight daytime (8AM - 11:59PM) hours. They may be deleted by course staff otherwise. This includes data resources (block storage volumes, object storage containers).