Data systems

In this week’s lesson, we noted that of all the previously identified “core organization capabilities” required to support large-scale machine learning, we have discussed all but:

Data processing: the capability for data transforms and featured engineering, on both structured and unstructured data, in batch and stream mode.
Data/feature store: the capability to share, discover, and reuse data and data pipelines

We discussed different types of data repositories:

Relational database: good for structured data with a pre-defined schema, when we need all the CRUD operations
Data warehouse: good for structured data with a pre-defined schema, when we will mostly read data
Document database: for structured or semi-structured data with a more flexible schema
Columnar database: good for structured data with a pre-defined schema, optimized for operations over columns (e.g. “get the average value of of this column”)
Data lake: appropriate for unstructured data without a schema, is most effective with a metadata layer on top
Data lakehouse: has a management layer on top of a data lake, to provide additional functionality

We also described the typical process of getting data into a data repository, with an ETL (Extract, Transform, Load) pipeline:

First, we extract data from one or more sources into a staging area (e.g. a local filesystem)
Then, we transform the data as required (organize, clean, compute offline engineered features)
Finally, we load the transformed data into a data repository.

We may use a workflow orchestrator (e.g. Airflow) to manage these pipelines.

For online (streaming) data, we won’t necessarily think about data as “living” in some sort of data repository. Instead, streaming data exists as it moves through a streaming data system. To work with streaming data, we will often have a broker who gets data generated by producers, and then makes it (reliably) available to one or more consumers.

A feature store may integrate all this functionality for ML systems, by taking batch data sources and streaming data sources and making them more readily discoverable and available for model training and inference.

Slides: Data systems

Lab assignment

Due TBD

Lab: Part 1: Persistent storage on Chameleon

Resource usage notes for this lab assignment:

You will do this lab assignment on KVM@TACC, which does not require reservation.

Your resources may not be “active” for more than eight daytime (8AM - 11:59PM) hours. They may be deleted by course staff otherwise. This includes data resources (block storage volumes, object storage containers).