To run this example locally, execute: ploomber examples -n parametrized.

To start a free, hosted JupyterLab: binder-logo

Found an issue? Let us know.

Have questions? Ask us anything on Slack.

Parametrized pipelines

Often, pipelines perform the same operation over different subsets of the data. For example, say you are developing visualizations of economic data. You might want to generate the same charts for other countries.

One way to approach the problem is to have a for loop on each pipeline task to process all needed countries. But such an approach adds unnecessary complexity to our code; it’s better to keep our logic simple (each task processes a single country) and take the iterative logic out of our pipeline.

Ploomber allows you to do so using parametrized pipelines. Let’s see a sample using a pipeline.yaml file.

Spec API (pipeline.yaml)

# Content of pipeline.yaml
tasks:
  - source: print.py
    name: print
    product:
      nb: 'output/{{some_param}}/notebook.ipynb'
    papermill_params:
        log_output: True
    params:
      some_param: '{{some_param}}'

The pipeline.yaml above has a placeholder called some_param. It is coming from a file called env.yaml:

# Content of env.yaml
some_param: default_value

When reading your pipeline.yaml, Ploomber looks for an env.yaml file. If found, all defined keys will be available to your pipeline definition. You can use these placeholders (placeholders are strings between double curly brackets) in any of the fields of your pipeline.yaml file.

In our case, we are using it in two places. First, we will save the executed notebook in a folder with the value of some_param; this will allow you to keep copies of the generated output in a different folder depending on your parameter. Second, if we want to use the parameter in our code, we have to pass it to our tasks; all tasks take an optional params with arbitrary parameters.

Let’s see how the code looks like:

# Content of print.py
# + tags=["parameters"]
upstream = None
product = None
some_param = None

# +
print('some_param: ', some_param, ' type: ', type(some_param))

Our task is a Python script, meaning that parameters are passed as an injected cell at runtime. Let’s see what happens if we build our pipeline.

[1]:
%%capture captured
%%sh
ploomber build --force --log INFO
[2]:
def filter_output(captured, startswith):
    return print('\n'.join([
        line for line in captured.stderr.split('\n')
        if line.startswith(startswith)
    ]))

filter_output(captured, startswith='INFO:papermill:some_param')
INFO:papermill:some_param:  default_value  type:  <class 'str'>

We see that our param some_param is taking the default value (default_value) as defined in env.yaml. The command-line interface is aware of any parameters. You can see them using the --help option:

[3]:
%%sh
ploomber build --help
usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--skip-upstream] [--partially PARTIALLY] [--debug]
                [--env--some_param ENV__SOME_PARAM]

Build pipeline

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point, defaults to pipeline.yaml
  --force, -f           Force execution by ignoring status
  --skip-upstream, -su  Skip building upstream dependencies. Only applicable
                        when using --partially
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --debug, -d           Drop a debugger session if an exception happens
  --env--some_param ENV__SOME_PARAM
                        Default: default_value

Apart from the default parameters from the ploomber build command, Ploomber automatically adds any parameters from env.yaml, we can easily override the default value. Let’s do that:

[4]:
%%capture captured
%%sh
ploomber build --force --env--some_param another_value --log INFO
[5]:
filter_output(captured, startswith='INFO:papermill:some_param')
INFO:papermill:some_param:  another_value  type:  <class 'str'>

We see that our task effectively changed the value!

Finally, let’s see how the output/ folder looks like:

[6]:
%%sh
tree output
output
├── another_value
│   └── notebook.ipynb
└── default_value
    └── notebook.ipynb

2 directories, 2 files

We have separate folders for each parameter, helping to keep things organized and taking the looping logic out of our pipeline.

Notes

  • There are some built-in placeholders that you can use without having an env.yaml file. For example, {{here}} will expand to the pipeline.yaml parent directory. Check out the Spec API documentation for more information.

  • This example uses a Python script as a task. In SQL pipeline, you can achieve the same effect by using the placeholder in the product’s schema or a table/view name prefix.

  • If the parameter takes many different values and you want to run your pipeline using all of them, calling them by hand might get tedious. So you have two options 1) write a bash script that calls the CLI with different value parameters or 2) Use the Python API (everything that the CLI can do, you can do with Python directly), take a look at the DAGSpec documentation.

  • Parametrized pipeline.yaml files are a great way to simplify a task’s logic but not overdo it. If you find yourself adding too many parameters, it’s a better idea to use the Python AP directly; factory functions are the correct pattern for highly customized pipeline construction.

  • Given that the two pipelines are entirely independent, we could even run them in parallel.

Python API (factory functions)

Parametrization is straightforward when using a factory function. If your factory takes parameters, they’ll also be available in the command-line interface. Types are inferred from type hints. Let’s see an example:

# Content of factory.py
from ploomber import DAG


def make(param: str, another: int = 10):
    dag = DAG()
    # add tasks to your pipeline...
    return dag

Our function takes two parameters: param and another. Parameters with no default values (param) turn into positional arguments, and function parameters with default values convert to optional parameters (another). To see the same auto-generated API, you can use the --help command:

[7]:
%%sh
ploomber build --entry-point factory.make --help
usage: ploomber [-h] [--log LOG] [--entry-point ENTRY_POINT] [--force]
                [--skip-upstream] [--partially PARTIALLY] [--debug]
                [--another ANOTHER]
                param

Build pipeline

positional arguments:
  param

optional arguments:
  -h, --help            show this help message and exit
  --log LOG, -l LOG     Enables logging to stdout at the specified level
  --entry-point ENTRY_POINT, -e ENTRY_POINT
                        Entry point, defaults to pipeline.yaml
  --force, -f           Force execution by ignoring status
  --skip-upstream, -su  Skip building upstream dependencies. Only applicable
                        when using --partially
  --partially PARTIALLY, -p PARTIALLY
                        Build a pipeline partially until certain task
  --debug, -d           Drop a debugger session if an exception happens
  --another ANOTHER

Note that the Python API requires more work than a pipeline.yaml file, but it is more flexible. [Click here] to see examples using the Python API.