Configuration (dev/prod)

In the previous guide (Parametrized pipelines), we saw how to use an env.yaml file to parametrize our pipeline and switch parameters from the command line.

Sometimes we want to change all the parameters at once. The most common scenario is to change configuration during development and production.

For example, say you’re working on a Machine Learning pipeline whose pipeline.yaml looks like this:

tasks:

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv
    params:
      sample_pct: '{{sample_pct}}'

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv

The pipeline above has one placeholder '{{sample_pct}}', which controls which percentage of raw data to download. You may want to develop locally with a fraction of the data, say 20%, to iterate quickly. To smoke test quickly, you may run it with a smaller sample, say 1%. Finally, to train a model, you’ll use 100% of the data.

By default, Ploomber looks for an env.yaml. To enable rapid local development with 20% of the data, you may create an env.yaml file like this:

sample_pct: 20

For smoke testing, env.test.yaml:

sample_pct: 1

And for training, env.train.yaml:

sample_pct: 100

To switch configurations, you can set the PLOOMBER_ENV_FILENAME environment variable to env.test.yaml in the testing environment and to env.train.yaml in the training environment.

Whenever PLOOMBER_ENV_FILENAME has a value, Ploomber uses it and looks for a file with such a name. Note that this must be a filename, not a path since Ploomber expects env.yaml files to exist in the same folder as the pipeline.yaml file.

Note: You can use placeholders (e.g., {{sample_pct}}) anywhere in the pipeline.yaml file. Another typical use case is to switch the product location (e.g., product: '{{product_directory}}/some-data.csv'.

Managing multiple pipelines

If your project has more than one pipeline, they’ll likely need different env.yaml files.

Say you have two pipelines, one for training a model (pipeline.yaml) and one for serving it (pipeline.serve.yaml). You can create an env.yaml file to parametrize pipeline.yaml and an env.serve.yaml to parametrize pipeline.serve.yaml:

project/
    pipeline.yaml
    pipeline.serve.yaml
    env.yaml
    env.serve.yaml

The general rule is as follows: When loading a pipeline.{name}.yaml, extract the {name} portion. Then look for a env.{name}.yaml file, if such file doesn’t exist, look for an env.yaml file. Note that the PLOOMBER_ENV_FILENAME environment variable overrides this process.

Alternatively, you may separate the pipelines into different directories, and put an env.yaml on each one:

project-a/
    pipeline.yaml
    env.yaml
project-b/
    pipeline.yaml
    env.yaml