# Create Reproducible Pipelines

The researcher to train the machine learning model needs to:

1. Fetch the data
2. Prepare the data
3. Run training
4. Evaluate the training run

We saw how to add the datasets and models to remote storage. The researcher can now get it with `dvc checkout` or `dvc pull`. The other steps were executed by running various Python files. These can be chained together into a single execution called a DVC **pipeline** that requires only one command.

For instance we can create a new branch and call it `<name of the algorithm>-pipeline`:

```
$ git checkout -b <name of the algorithm>-pipeline
```

We'll use this branch to rerun the experiment as a DVC pipeline. A pipeline consists of multiple stages and is executed using a `dvc run` command. Each stage has three components:

1. Inputs
2. Outputs
3. Command

DVC uses the term **dependencies** for inputs and **outs** for outputs. Each of the three Python files, `prepare.py`, `train.py`, and `evaluate.py` will be represented by a stage in the pipeline.

A pipeline automatically adds newly created files to DVC control, just as if we've typed `dvc add`.

First, we're going to run `prepare.py` as a DVC pipeline stage. The command for this is `dvc run`, which needs to know the dependencies, outputs, and command:

1. **Dependencies:** `prepare.py`, the data in `dataset_id/raw` and the `prepare-config.json`
2. **Outputs:** `dataset_id/prepared/train/` , `dataset_id/prepared/` and `dataset_id/prepared/`
3. **Command:** `python prepare.py`

Execute `prepare.py` as a DVC pipeline stage with the `dvc run` command:

```
 1 $ dvc run -n prepare \
 2         -d src/prepare.py -d dataset_id/raw -d prepare-config.json\
 3         -o dataset_id/prepared/train -o data/prepared/val -o data/prepared/test \
 4         python src/prepare.py
```

All this is a single command. The first row starts the `dvc run` command and accepts a few options:

* The **`-n`** switch gives the stage a name.
* The **\`-d'** switch passes the dependencies to the command.
* The **\`-o'** switch defines the outputs of the command.

Once we create the stage, DVC will create two files, `dvc.yaml` and `dvc.lock`.

This way we've automated the first stage of the pipeline. We can apply the same approach to automatize the next two stages.

The next stage in the pipeline is training. The dependencies are the `train.py` file itself, the `model-config.json` and the `data/prepared` subfolders. The only output is the `model.dat` file. To create a pipeline stage out of `train.py`, execute it with `dvc run`, specifying the correct dependencies and outputs:

```
$ dvc run -n train \
        -d src/train.py -d data/prepared/train/ -d data/prepared/val/ -d model-config.json\
        -o model/model.dat \
        python src/train.py
```

This will create the second stage of the pipeline and record it in the `dvc.yml` and `dvc.lock` files.

The final stage will be the evaluation. The dependencies are the `evaluate.py` file and the model file generated in the previous stage. The output is the metrics file, `metrics.json`. Execute `evaluate.py` with `dvc run`:

```
$ dvc run -n evaluate \
        -d src/evaluate.py -d model/model.dat \
        -M metrics/metrics.json \
        python src/evaluate.py
```

Notice that we used the `-M` switch instead of `-o'. This is because DVC treats metrics differently from other outputs. For example, when we run this command, it will generate the` metrics.json\` file, but DVC will know that it's a metric used to measure the model's performance.

We can get DVC to show you all the metrics it knows about with the `dvc show` command:

```
$ dvc metrics show
        metrics/metrics.json:
            accuracy: 0.6996197718631179
```

Now we can tag the new branch and push all the changes to GitHub and DVC:

```
$ git add --all
$ git commit -m "Rerun <name of the algorithm> as pipeline"
$ dvc commit
$ git push --set-upstream origin <name of the algorithm>-pipeline
$ git tag -a <name of the algorithm>-pipeline -m "Trained <name of the algorithm> as DVC pipeline."
$ git push origin --tags
$ dvc push
```

This will version and store the new DVC pipeline's code, models, and data.

If we now move to another algorithm we can start by creating and checking out a new branch and calling it `<name of the new algorithm>`:

```
$ git checkout -b "<name of the new algorithm>"
```

and change the algorithm in the `model-config.json`

Since the `train.py` file changed, its MD5 hash has changed. DVC will realize that one of the pipeline stages needs to be reproduced. We can check what changed with the `dvc status` command:

```
$ dvc status
train:
    Changed deps:
        modified:           src/train.py
```

This will display all the changed dependencies for every stage of the pipeline. Since the model change will also affect the metric, we want to reproduce the whole chain. We can reproduce any DVC pipeline file with the `dvc repro` command:

```
$ dvc repro evaluate
```

When we run the `repro` command, DVC checks the entire pipeline's dependencies to determine what's changed and which commands need to be executed again. Think about what this means. We can jump from branch to branch and reproduce any experiment with a single command.

To wrap up, push your classifier code to GitHub and the model to DVC:

```
$ git add --all
$ git commit -m "Train <name of the new algorithm> classifier"
$ dvc commit
$ git push --set-upstream origin <name of the new algorithm>t
$ git tag -a <name of the new algorithm> -m "<name of the new algorithm> classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
```

Now we can compare metrics across multiple branches and tags.

Call `dvc metrics show` with the `-T` switch to display metrics across multiple tags:

```
$ dvc metrics show -T
<name of the previous algorithm>:
    metrics/metrics.json:
        accuracy: 0.6996197718631179
<name of the new algorithm>:
    metrics/metrics.json:
        accuracy: 0.8098859315589354
```

This gives us a quick way to keep track of what the best-performing experiment was in the repository.

When a data scientist returns to this project in six months and doesn't remember the details, he can check which setup was the most successful with `dvc metrics show -T` and reproduce it with `dvc repro`. Anyone else who wants to reproduce that work can do the same. They'll need to take three steps:

1. Run `git clone` or `git checkout` to get the code and `.dvc` files.
2. Get the training data with `dvc checkout`.
3. Reproduce the entire workflow with `dvc repro evaluate`.

We can run multiple experiments and safely versioned and backed up the data and models. Moreover, we can quickly reproduce each experiment by getting the necessary code and data and executing a single `dvc repro` command.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://umbertogriffo.gitbook.io/how-to-quickly-reproduce-your-computer-vision-mode/solution/create-reproducible-pipelines.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
