ML Projects
Table of contents
Each team member can get or push a dataset or a model from the data registry and add it to his current project:
This downloads the dataset_1/raw/
directory from the remote data registry and places it in the current working directory.
Here's a possible folder structure for the repository of the ML project (We can take inspiration from [4]):
There are seven types of folders in that repository:
src/
is for source code.dataset_1/prepared/train/
is for data prepared for training.dataset_1/prepared/val/
is for data prepared for validating.dataset_1/prepared/test/
is for data prepared for testing.model/
is for machine learning models.metrics/
is for tracking the models' performance metrics.config/
is for tracking the configuration of the pipeline.
The src/
folder contains three Python files:
prepare.py
contains code for preparing data for training.train.py
contains code for training a machine learning model.evaluate.py
contains code for evaluating the results of a machine learning model.
The config/
folder contains two JSON files:
prepare-config.json
contains the parameters to produce the dataset (custom filters, crop, scale, flip, rotate, etc.).model-config.json
contains the parameters to produce the model (algorithm, hyper-parameters, etc.).
The requirements.txt
file for reproducing the analysis environment (for instance in a Docker container), e.g. generated with pip freeze > requirements.txt
.
We initialize Git and DVC, ensuring we're positioned in the top-level folder of the project repository.
In this case, we will follow the basic rule of thumb of sending the small files to GitHub, and large files to DVC remote storage.
Training and evaluating a Model
To train a model, the data scientist can use the desired method. To simplify, let's say he wants to use supervised learning. This method involves showing the model an image and it's label and making it learn.
The src/
folder contains prepare.py
contains code for preparing data for training.
To prepare the dataset that will be stored in dataset_1/prepared
the researcher runs the prepare.py
script, which has five main steps:
Read the configuration.
Download in
dataset_1/raw/
the original images from S3 using the URLs indataset_1/raw/raw_uuid.csv
.Preprocess the images in
dataset_1/raw
.Split them into train, validation and test set (This really depends on the researcher's needs).
Save the processed images in
dataset_1/prepared
.
And commit and push this dataset version afterwards:
Here's what DVC does under the hood:
Adds add
dataset_1/prepared/train/
folders to.gitignore
Adds add
dataset_1/prepared/val/
folders to.gitignore
Adds add
dataset_1/prepared/test/
folders to.gitignore
Creates a file with the
.dvc
extension,train.dvc
,val.dvc
andtest.dvc
Copies the folders to a staging area.
Let's assume that to run the train.py
file, the data scientist will execute four steps:
Load the already preprocessed train and validation images into memory.
Load the class labels into memory.
Train a machine learning model to classify the images.
Save the machine learning model to the local disk.
When the script finishes, we will have a trained machine learning model saved in the model/
folder with the name model.dat
. This is the most important file of the experiment. It needs to be added to DVC, with the corresponding .dvc
file committed to GitHub:
To evaluate the model, Let's assume we need to run the evaluate.py
file and we get the metrics that are safely stored in a the metrics.json
file in the metrics
folder. Then, whenever we change a hyper-parameter of the model or use a different pre-processing, we can see if it's improving by comparing it to these results.
In this case, let's assume the JSON file contains only one object, the accuracy of the model:
Since the metrics
JSON file is tiny, and it's useful to keep it in GitHub so you can quickly check how well each experiment performed:
Every time we run an experiment, we want to know precisely what inputs went into the system and what outputs were created.
First, we should push all the changes we've made to the first experiment to GitHub and DVC remote storage:
The code and model are now backed up on remote storage.
Tagging Commits to Mark a Model Version
A common practice is to use tagging to mark a specific point in your Git history as being important 5. So, for example, if we've completed an experiment and produced a new model, we create a tag to signal to all the team members that we have a ready-to-go model:
Also in that case the team is free to version models with version numbers, like v1.0
, v1.3
, and so on.
Git tags aren't pushed with regular commits, so they must be pushed separately to our repository's origin on GitHub. Use the --tags
switch to push all tags from our local repository to the remote:
We can always have a look at all the tags, so the dataset versions, in the current repository:
Creating One Git Branch Per Experiment
We can create a new branch for every experiment. Let's say that in the first experiment, we set the maximum number of iterations of the model to 10
. We can try increasing that number to see if it improves the result. Create a new branch and call it <name of the algorithm>-100-iterations
:
When we create a new branch, all the .dvc
files we had in the previous branch will be present in the new branch, just like other files and folders.
We can update the parameter in model-config.json
so that the model has the parameter max_iter=100
and rerun the training and evaluation by running train.py
and evaluate.py
. We will have a new model.dat
file and a new metrics.json
file.
Since the training process has changed the model.dat
file, we need to commit it to the DVC cache:
Remember, dvc commit
works differently from git commit
and updates an already tracked file. This won't delete the previous model but will create a new one.
Add and commit the changes we've made to Git:
We can also jump between branches by checking out the code from GitHub and then checking out the model from DVC.
Now we have multiple experiments and their results versioned and stored, and we can access them by checking out the content via Git and DVC.
Last updated
Was this helpful?