ML Projects

Table of contents

Each team member can get or push a dataset or a model from the data registry and add it to his current project:

$ dvc get https://github.com/example/dataset-registry \
          dataset_1/raw/

This downloads the dataset_1/raw/ directory from the remote data registry and places it in the current working directory.

Here's a possible folder structure for the repository of the ML project (We can take inspiration from [4]):

project/
├── dataset_1/           <- A folder for each dataset related to a context or a project.
│   ├── prepared/        <- The final, canonical data sets for modelling.
│       ├── train/       <- Includes images used for training a model.
│       ├── val/         <- Includes images used for validating a model.
│       └── test/         <- Includes images used for testing a model.
│   └── raw/             <- The original, immutable data dump. 
│       └── raw_uuid.csv <- The file containing the list of the UUIDs got from Looker. 
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
├── metrics/
├── model/
├── config/          
│   ├── prepare-config.json
│   └── model-config.json
├── src/ 
    ├── prepare.py
    ├── train.py
    └── evaluate.py

There are seven types of folders in that repository:

  1. src/ is for source code.

  2. dataset_1/prepared/train/ is for data prepared for training.

  3. dataset_1/prepared/val/ is for data prepared for validating.

  4. dataset_1/prepared/test/ is for data prepared for testing.

  5. model/ is for machine learning models.

  6. metrics/ is for tracking the models' performance metrics.

  7. config/ is for tracking the configuration of the pipeline.

The src/ folder contains three Python files:

  1. prepare.py contains code for preparing data for training.

  2. train.py contains code for training a machine learning model.

  3. evaluate.py contains code for evaluating the results of a machine learning model.

The config/ folder contains two JSON files:

  1. prepare-config.json contains the parameters to produce the dataset (custom filters, crop, scale, flip, rotate, etc.).

  2. model-config.json contains the parameters to produce the model (algorithm, hyper-parameters, etc.).

The requirements.txt file for reproducing the analysis environment (for instance in a Docker container), e.g. generated with pip freeze > requirements.txt.

We initialize Git and DVC, ensuring we're positioned in the top-level folder of the project repository.

In this case, we will follow the basic rule of thumb of sending the small files to GitHub, and large files to DVC remote storage.

Training and evaluating a Model

To train a model, the data scientist can use the desired method. To simplify, let's say he wants to use supervised learning. This method involves showing the model an image and it's label and making it learn.

The src/ folder contains prepare.py contains code for preparing data for training.

To prepare the dataset that will be stored in dataset_1/prepared the researcher runs the prepare.py script, which has five main steps:

  1. Read the configuration.

  2. Download in dataset_1/raw/ the original images from S3 using the URLs in dataset_1/raw/raw_uuid.csv .

  3. Preprocess the images in dataset_1/raw.

  4. Split them into train, validation and test set (This really depends on the researcher's needs).

  5. Save the processed images in dataset_1/prepared.

And commit and push this dataset version afterwards:

$ dvc add dataset_1/prepared/train
$ dvc add dataset_1/prepared/val
$ dvc add dataset_1/prepared/test
$ git add --all
$ git commit -m "crop selfie in dataset_1/prepared/"
$ dvc push
$ git push

Here's what DVC does under the hood:

  1. Adds add dataset_1/prepared/train/ folders to .gitignore

  2. Adds add dataset_1/prepared/val/ folders to .gitignore

  3. Adds add dataset_1/prepared/test/ folders to .gitignore

  4. Creates a file with the .dvc extension, train.dvc, val.dvc and test.dvc

  5. Copies the folders to a staging area.

Let's assume that to run the train.py file, the data scientist will execute four steps:

  1. Load the already preprocessed train and validation images into memory.

  2. Load the class labels into memory.

  3. Train a machine learning model to classify the images.

  4. Save the machine learning model to the local disk.

When the script finishes, we will have a trained machine learning model saved in the model/ folder with the name model.dat. This is the most important file of the experiment. It needs to be added to DVC, with the corresponding .dvc file committed to GitHub:

$ dvc add model/model.dat
$ git add --all
$ git commit -m "Trained a <name of the algorithm> classifier"

To evaluate the model, Let's assume we need to run the evaluate.py file and we get the metrics that are safely stored in a the metrics.json file in the metrics folder. Then, whenever we change a hyper-parameter of the model or use a different pre-processing, we can see if it's improving by comparing it to these results.

In this case, let's assume the JSON file contains only one object, the accuracy of the model:

{"accuracy": 0.670595690747782 }

Since the metrics JSON file is tiny, and it's useful to keep it in GitHub so you can quickly check how well each experiment performed:

$ git add --all
$ git commit -m "Evaluate the <name of the algorithm> model accuracy"

Every time we run an experiment, we want to know precisely what inputs went into the system and what outputs were created.

First, we should push all the changes we've made to the first experiment to GitHub and DVC remote storage:

$ git push
$ dvc push

The code and model are now backed up on remote storage.

Tagging Commits to Mark a Model Version

A common practice is to use tagging to mark a specific point in your Git history as being important 5. So, for example, if we've completed an experiment and produced a new model, we create a tag to signal to all the team members that we have a ready-to-go model:

$ git tag -a <name of the algorithm>-classifier -m "<name of the algorithm> with accuracy 67.06%"

Also in that case the team is free to version models with version numbers, like v1.0, v1.3, and so on.

Git tags aren't pushed with regular commits, so they must be pushed separately to our repository's origin on GitHub. Use the --tags switch to push all tags from our local repository to the remote:

$ git push origin --tags

We can always have a look at all the tags, so the dataset versions, in the current repository:

$ git tag

Creating One Git Branch Per Experiment

We can create a new branch for every experiment. Let's say that in the first experiment, we set the maximum number of iterations of the model to 10. We can try increasing that number to see if it improves the result. Create a new branch and call it <name of the algorithm>-100-iterations:

$ git checkout -b "<name of the algorithm>-100-iterations"

When we create a new branch, all the .dvc files we had in the previous branch will be present in the new branch, just like other files and folders.

We can update the parameter in model-config.json so that the model has the parameter max_iter=100 and rerun the training and evaluation by running train.py and evaluate.py. We will have a new model.dat file and a new metrics.json file.

Since the training process has changed the model.dat file, we need to commit it to the DVC cache:

$ dvc commit

Remember, dvc commit works differently from git commit and updates an already tracked file. This won't delete the previous model but will create a new one.

Add and commit the changes we've made to Git:

$ git add --all
$ git commit -m "Change <name of the algorithm> max_iter to 100"
$ git push
$ dvc push

We can also jump between branches by checking out the code from GitHub and then checking out the model from DVC.

Now we have multiple experiments and their results versioned and stored, and we can access them by checking out the content via Git and DVC.

Last updated