Create Reproducible Pipelines
The researcher to train the machine learning model needs to:
Fetch the data
Prepare the data
Run training
Evaluate the training run
We saw how to add the datasets and models to remote storage. The researcher can now get it with dvc checkout
or dvc pull
. The other steps were executed by running various Python files. These can be chained together into a single execution called a DVC pipeline that requires only one command.
For instance we can create a new branch and call it <name of the algorithm>-pipeline
:
We'll use this branch to rerun the experiment as a DVC pipeline. A pipeline consists of multiple stages and is executed using a dvc run
command. Each stage has three components:
Inputs
Outputs
Command
DVC uses the term dependencies for inputs and outs for outputs. Each of the three Python files, prepare.py
, train.py
, and evaluate.py
will be represented by a stage in the pipeline.
A pipeline automatically adds newly created files to DVC control, just as if we've typed dvc add
.
First, we're going to run prepare.py
as a DVC pipeline stage. The command for this is dvc run
, which needs to know the dependencies, outputs, and command:
Dependencies:
prepare.py
, the data indataset_id/raw
and theprepare-config.json
Outputs:
dataset_id/prepared/train/
,dataset_id/prepared/
anddataset_id/prepared/
Command:
python prepare.py
Execute prepare.py
as a DVC pipeline stage with the dvc run
command:
All this is a single command. The first row starts the dvc run
command and accepts a few options:
The
-n
switch gives the stage a name.The `-d' switch passes the dependencies to the command.
The `-o' switch defines the outputs of the command.
Once we create the stage, DVC will create two files, dvc.yaml
and dvc.lock
.
This way we've automated the first stage of the pipeline. We can apply the same approach to automatize the next two stages.
The next stage in the pipeline is training. The dependencies are the train.py
file itself, the model-config.json
and the data/prepared
subfolders. The only output is the model.dat
file. To create a pipeline stage out of train.py
, execute it with dvc run
, specifying the correct dependencies and outputs:
This will create the second stage of the pipeline and record it in the dvc.yml
and dvc.lock
files.
The final stage will be the evaluation. The dependencies are the evaluate.py
file and the model file generated in the previous stage. The output is the metrics file, metrics.json
. Execute evaluate.py
with dvc run
:
Notice that we used the -M
switch instead of -o'. This is because DVC treats metrics differently from other outputs. For example, when we run this command, it will generate the
metrics.json` file, but DVC will know that it's a metric used to measure the model's performance.
We can get DVC to show you all the metrics it knows about with the dvc show
command:
Now we can tag the new branch and push all the changes to GitHub and DVC:
This will version and store the new DVC pipeline's code, models, and data.
If we now move to another algorithm we can start by creating and checking out a new branch and calling it <name of the new algorithm>
:
and change the algorithm in the model-config.json
Since the train.py
file changed, its MD5 hash has changed. DVC will realize that one of the pipeline stages needs to be reproduced. We can check what changed with the dvc status
command:
This will display all the changed dependencies for every stage of the pipeline. Since the model change will also affect the metric, we want to reproduce the whole chain. We can reproduce any DVC pipeline file with the dvc repro
command:
When we run the repro
command, DVC checks the entire pipeline's dependencies to determine what's changed and which commands need to be executed again. Think about what this means. We can jump from branch to branch and reproduce any experiment with a single command.
To wrap up, push your classifier code to GitHub and the model to DVC:
Now we can compare metrics across multiple branches and tags.
Call dvc metrics show
with the -T
switch to display metrics across multiple tags:
This gives us a quick way to keep track of what the best-performing experiment was in the repository.
When a data scientist returns to this project in six months and doesn't remember the details, he can check which setup was the most successful with dvc metrics show -T
and reproduce it with dvc repro
. Anyone else who wants to reproduce that work can do the same. They'll need to take three steps:
Run
git clone
orgit checkout
to get the code and.dvc
files.Get the training data with
dvc checkout
.Reproduce the entire workflow with
dvc repro evaluate
.
We can run multiple experiments and safely versioned and backed up the data and models. Moreover, we can quickly reproduce each experiment by getting the necessary code and data and executing a single dvc repro
command.
Last updated
Was this helpful?