Create Reproducible Pipelines

The researcher to train the machine learning model needs to:

  1. Fetch the data

  2. Prepare the data

  3. Run training

  4. Evaluate the training run

We saw how to add the datasets and models to remote storage. The researcher can now get it with dvc checkout or dvc pull. The other steps were executed by running various Python files. These can be chained together into a single execution called a DVC pipeline that requires only one command.

For instance we can create a new branch and call it <name of the algorithm>-pipeline:

$ git checkout -b <name of the algorithm>-pipeline

We'll use this branch to rerun the experiment as a DVC pipeline. A pipeline consists of multiple stages and is executed using a dvc run command. Each stage has three components:

  1. Inputs

  2. Outputs

  3. Command

DVC uses the term dependencies for inputs and outs for outputs. Each of the three Python files, prepare.py, train.py, and evaluate.py will be represented by a stage in the pipeline.

A pipeline automatically adds newly created files to DVC control, just as if we've typed dvc add.

First, we're going to run prepare.py as a DVC pipeline stage. The command for this is dvc run, which needs to know the dependencies, outputs, and command:

  1. Dependencies: prepare.py, the data in dataset_id/raw and the prepare-config.json

  2. Outputs: dataset_id/prepared/train/ , dataset_id/prepared/ and dataset_id/prepared/

  3. Command: python prepare.py

Execute prepare.py as a DVC pipeline stage with the dvc run command:

 1 $ dvc run -n prepare \
 2         -d src/prepare.py -d dataset_id/raw -d prepare-config.json\
 3         -o dataset_id/prepared/train -o data/prepared/val -o data/prepared/test \
 4         python src/prepare.py

All this is a single command. The first row starts the dvc run command and accepts a few options:

  • The -n switch gives the stage a name.

  • The `-d' switch passes the dependencies to the command.

  • The `-o' switch defines the outputs of the command.

Once we create the stage, DVC will create two files, dvc.yaml and dvc.lock.

This way we've automated the first stage of the pipeline. We can apply the same approach to automatize the next two stages.

The next stage in the pipeline is training. The dependencies are the train.py file itself, the model-config.json and the data/prepared subfolders. The only output is the model.dat file. To create a pipeline stage out of train.py, execute it with dvc run, specifying the correct dependencies and outputs:

$ dvc run -n train \
        -d src/train.py -d data/prepared/train/ -d data/prepared/val/ -d model-config.json\
        -o model/model.dat \
        python src/train.py

This will create the second stage of the pipeline and record it in the dvc.yml and dvc.lock files.

The final stage will be the evaluation. The dependencies are the evaluate.py file and the model file generated in the previous stage. The output is the metrics file, metrics.json. Execute evaluate.py with dvc run:

$ dvc run -n evaluate \
        -d src/evaluate.py -d model/model.dat \
        -M metrics/metrics.json \
        python src/evaluate.py

Notice that we used the -M switch instead of -o'. This is because DVC treats metrics differently from other outputs. For example, when we run this command, it will generate the metrics.json` file, but DVC will know that it's a metric used to measure the model's performance.

We can get DVC to show you all the metrics it knows about with the dvc show command:

$ dvc metrics show
        metrics/metrics.json:
            accuracy: 0.6996197718631179

Now we can tag the new branch and push all the changes to GitHub and DVC:

$ git add --all
$ git commit -m "Rerun <name of the algorithm> as pipeline"
$ dvc commit
$ git push --set-upstream origin <name of the algorithm>-pipeline
$ git tag -a <name of the algorithm>-pipeline -m "Trained <name of the algorithm> as DVC pipeline."
$ git push origin --tags
$ dvc push

This will version and store the new DVC pipeline's code, models, and data.

If we now move to another algorithm we can start by creating and checking out a new branch and calling it <name of the new algorithm>:

$ git checkout -b "<name of the new algorithm>"

and change the algorithm in the model-config.json

Since the train.py file changed, its MD5 hash has changed. DVC will realize that one of the pipeline stages needs to be reproduced. We can check what changed with the dvc status command:

$ dvc status
train:
    Changed deps:
        modified:           src/train.py

This will display all the changed dependencies for every stage of the pipeline. Since the model change will also affect the metric, we want to reproduce the whole chain. We can reproduce any DVC pipeline file with the dvc repro command:

$ dvc repro evaluate

When we run the repro command, DVC checks the entire pipeline's dependencies to determine what's changed and which commands need to be executed again. Think about what this means. We can jump from branch to branch and reproduce any experiment with a single command.

To wrap up, push your classifier code to GitHub and the model to DVC:

$ git add --all
$ git commit -m "Train <name of the new algorithm> classifier"
$ dvc commit
$ git push --set-upstream origin <name of the new algorithm>t
$ git tag -a <name of the new algorithm> -m "<name of the new algorithm> classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push

Now we can compare metrics across multiple branches and tags.

Call dvc metrics show with the -T switch to display metrics across multiple tags:

$ dvc metrics show -T
<name of the previous algorithm>:
    metrics/metrics.json:
        accuracy: 0.6996197718631179
<name of the new algorithm>:
    metrics/metrics.json:
        accuracy: 0.8098859315589354

This gives us a quick way to keep track of what the best-performing experiment was in the repository.

When a data scientist returns to this project in six months and doesn't remember the details, he can check which setup was the most successful with dvc metrics show -T and reproduce it with dvc repro. Anyone else who wants to reproduce that work can do the same. They'll need to take three steps:

  1. Run git clone or git checkout to get the code and .dvc files.

  2. Get the training data with dvc checkout.

  3. Reproduce the entire workflow with dvc repro evaluate.

We can run multiple experiments and safely versioned and backed up the data and models. Moreover, we can quickly reproduce each experiment by getting the necessary code and data and executing a single dvc repro command.

Last updated