Data and Model Repository
The data science team uses Looker to explore the data and select the images that they want to use to build a dataset for a specific use case or project. Next, the user creates a filter f
to get the set of UUIDs of the images and the metadata as well.
This set of the selected images represents a dataset that we want version and use in different projects in which we will run different ML experiments.
Since we need to version multiple datasets and models, we build a data registry. A data registry is a kind of data management middleware between ML projects and cloud storage.
Here are its advantages [1]:
Reusability: reproduce and organize feature stores with a simple CLI (
dvc get
anddvc import
commands, similar to software package management systems likepip
).Persistence: the DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. For example, there are fewer chances someone can delete or rewrite a model.
Storage optimization: track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
Data as code: leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle.
Security: registries can be set up to have read-only remote storage (e.g. an HTTP location).
In a nutshell, we can build a DVC project dedicated to tracking and versioning datasets and models. The repository would have all the metadata and history of changes in the different datasets and models. We can see who updated what and when and use pull requests to update data, the same way we do with code.
Regarding the datasets here's the possible folder structure for the data registry:
We initialize Git and DVC, ensuring we're positioned in the top-level folder of the repository. We also need to set remote storage (in our case S3) for the data files controlled by DVC:
Now we can:
Track files
Upload files
Download files
DVC now knows where to back up the datasets and the models.
Last updated
Was this helpful?