Exploring and Managing datasets and experiments with a Catalog

The team must explore the datasets, see who updated what and when, compare experiments, delete media from all datasets on request from a client and update the models accordingly. All that information is in the data repository's metadata and history of changes.

To allow them to do this, we could implement an Elasticsearch-based Catalog like Quilt Catalog (but with more features) that provides a user interface on top of the data registry on S3 to browse, search, compare, and delete datasets, models or media from a dataset.

We can point this Catalog to the already implemented data registry, allowing the team members to run routines to:

  • Browse, search and delete datasets.

  • Browse, search and delete models.

  • Delete the media from all datasets automatically.

  • See who updated what and when.

  • Shows metrics and configurations to compare experiments quickly.

Using the DVC Python API, we can run those routines on demand. So, for instance, when the user calls the deletion routine, the web app should trigger a commit and a push to update the version of the impacted datasets like:

$ git commit -m "Delete 20,000 images in dataset_1/raw/"

Rerun the full ML pipeline, tagging a new version as we already described and marking the previous model as deprecated since we can no longer reproduce the same model. The trigger should be optional because if we delete images impacting several ML pipelines, this process could be cumbersome and be overkill in the research context.

Last updated