Architecture

We need to track and version datasets and models for each experiment considering that the team wants to apply filters to select the images based on some criteria. To address these problems, we could use data versioning software such as DVC, which provides a tool for versioning the data and Looker to explore and filter the images.

All the original images are stored in S3, addressed by a universally-unique identifier. For example: s3://host-name/ab208998-17b3-4e67-a6b2-c7cfea89629a/original. There is a media metadata where there is a mapping between metadata and UUIDs. Looker is used to filter (i.e. "media UUIDs of french documents uploaded within the last week from these clients, where the image quality score was below a certain threshold") to get a set of UUIDs from this data.

To operationalize the ML workflow, we will use a combination of Git, DVC and an Elasticsearch-based Catalog like Quilt Catalog to provide an interface on top of S3 to browse, search, and delete media from datasets.

DVC is a command-line tool written in Python that mimics Git commands and workflows. DVC is meant to be run alongside Git. In fact, the git and dvc commands will often be used in tandem, one after the other. During Git stores and version codes, DVC does the same for data and model files. DVC uses a remote repository to store all data and models and support S3. So, we can easily set up a remote repository on S3 to store them.

When we store data and models in S3, a .dvc file is created. A .dvc file is a small text file pointing to the data files in remote storage. The .dvc file is lightweight and we can store it with the code in GitHub. When we download a Git repository, we also get the .dvc files. Extensive data and model files go into the DVC remote storage, and small .dvc files that point to the binary files go into GitHub.

Last updated