Data and Model Repository

The data science team uses Looker to explore the data and select the images that they want to use to build a dataset for a specific use case or project. Next, the user creates a filter f to get the set of UUIDs of the images and the metadata as well.

This set of the selected images represents a dataset that we want version and use in different projects in which we will run different ML experiments.

Since we need to version multiple datasets and models, we build a data registry. A data registry is a kind of data management middleware between ML projects and cloud storage.

Here are its advantages [1]:

  • Reusability: reproduce and organize feature stores with a simple CLI (dvc get and dvc import commands, similar to software package management systems like pip).

  • Persistence: the DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. For example, there are fewer chances someone can delete or rewrite a model.

  • Storage optimization: track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.

  • Data as code: leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle.

  • Security: registries can be set up to have read-only remote storage (e.g. an HTTP location).

In a nutshell, we can build a DVC project dedicated to tracking and versioning datasets and models. The repository would have all the metadata and history of changes in the different datasets and models. We can see who updated what and when and use pull requests to update data, the same way we do with code.

Regarding the datasets here's the possible folder structure for the data registry:

dataset-registry/
├── dataset_1/           <- A folder for each dataset related to a context or a project.
│   ├── prepared/        <- The final canonical data sets for modelling.
│       ├── train/       <- Includes images used for training a model.
│       ├── val/         <- Includes images used for validating a model.
│       └── test/         <- Includes images used for testing a model.
│   └── raw/             <- The original, immutable data dump. 
│       └── raw/         <- The file containing the list of the UUIDs got from Looker. 
├── dataset_2/
│   ├── prepared/ 
│   └── raw/
├── ...
├── dataset_N/ 
    ├── prepared/ 
    └── raw/

We initialize Git and DVC, ensuring we're positioned in the top-level folder of the repository. We also need to set remote storage (in our case S3) for the data files controlled by DVC:

$ dvc remote add -d remote_storage s3://path/to/dvc_remote

Now we can:

  1. Track files

  2. Upload files

  3. Download files

DVC now knows where to back up the datasets and the models.

Last updated