Basic concepts

This guide explains Ploomber’s core concepts.

Ploomber allows you to quickly turn a collection of scripts, notebooks, or functions into a data pipeline by following three conventions:

  1. Each task is a function, script or notebook.

  2. Tasks declare their dependencies using an upstream variable.

  3. Tasks declare their outputs using a product variable.

A simple pipeline

Let’s say we want to build a pipeline to plot some data. Instead of coding everything in a single file, we’ll break down logic in three steps, which will make our code more maintainable and easier to test:

graph LR 1-raw --> 2-clean --> 3-plot

Note

A pipeline is also known as a directed acyclic graph (or DAG). We use both of these terms interchangeably.

In a Ploomber pipeline, outputs (also known as products) from one task become inputs of “downstream” tasks. Hence, “upstream” dependencies read from left to right. For example, raw is an “upstream” dependency of clean.

An “upstream” dependency implies that a given task uses its upstream products as inputs. Following the pipeline example, clean uses raw’s products, and plot uses clean’s products.

Ploomber supports three types of tasks:

  1. Python functions (also known as callables)

  2. Python scripts/notebooks (and their R equivalents)

  3. SQL scripts

You can mix any combination of tasks in your pipeline. For example, you can dump data with a SQL query, then plot it with Python.

Defining a pipeline

To execute a pipeline, Ploomber needs to know the location of the task’s source code (source key), and the location of the task’s products (product key). You can do this via a pipeline.yaml file:

tasks:
  # this is a sql script task
  - source: raw.sql
    product: [schema, name, table]
    # ...

  # this is a function task
  # "my_functions.clean" is equivalent to: from my_functions import clean
  - source: my_functions.clean
    product: output/clean.csv

  # this is a script task (notebooks work the same)
  - source: plot.py
    product:
      # scripts always generate a notebook (more on this in the next section)
      nb: output/plots.ipynb

Note

You can set a task name using name. If not present, Ploomber infers it from the source value by removing the extension to the file’s name.

Once you have a pipeline.yaml file, you can run it with:

ploomber build

Ploomber keeps track of source changes to skip up-to-date tasks. If you run ploomber build again, only tasks whose source code has changed are executed. This helps iterate faster, as changes to the pipeline only trigger the least number of tasks.

Tip

You can use the resources_ section in a task definition to tell Ploomber to track the content of other files. Click here to learn more.

For a full reference on pipeline.yaml files see: Spec API (pipeline.yaml).

Let’s now see how to use scripts and notebooks as pipeline tasks.

Tasks: scripts/notebooks

Jupyter notebooks files (.ipynb) contain both code and output; while convenient, keeping code and outputs in the same file makes version control (i.e., git) difficult.

Our recommended approach is to use scripts as sources. However, thanks to the integration with Jupyter, you can open scripts as notebooks. The following image shows a side-by-side comparison of the same source code as .py (script) and as a .ipynb (notebook) file:

py-and-ipynb

Note that the .py script has some # %% comments. Such markers allow us to delimit code cells and render the .py file as a notebook.

Note

The # %% is one way of representing .py as notebooks. Ploomber uses jupytext to perform the conversion, other formats such as the “light” (# +) format work too. Editors such as VS Code, Spyder, and PyCharm support the “percent” format.

To keep the benefits of the .ipynb format, Ploomber creates a copy of your scripts, converts them to .ipynb at runtime and executes them. This is a crucial concept: scripts are part of your project’s source code, but executed notebooks are pipeline products.

Note

Even though we recommend the use of .py files, you can still use regular .ipynb files as sources if you prefer so.

To know more about integration with Jupyter notebooks, see the Jupyter integration guide.

R scripts/notebooks are supported as well.

upstream and product

To specify task dependencies, include a special parameters cell in your script/notebook. Following the example pipeline, clean has raw as an upstream dependency as the raw task is an input to the clean task. We establish this relation by declaring an upstream variable with a list of task names that should execute before the file we’re editing. If a script/notebook has no dependencies, set upstream = None.

# %% tags=["parameters"]
upstream = ['raw'] # this means: execute raw.py, then clean.py
product = None

Important

product = None is a placeholder. It states that our script takes an input parameter called product, but the actual value is automatically replaced at runtime, we explain this in the upcoming section.

Note

the # %% markers only apply to scripts. Click here for information on adding tags to .ipynb files.

The cell injection process

Note

For tips on troubleshooting pipeline loading, see Troubleshooting pipeline loading.

Let’s review the contents of a sample clean.py file:

import pandas as pd

# %% tags=["parameters"]
upstream = ['raw']
product = None

# %%
df = pd.read_csv(upstream['raw']['data'])
# some data cleaning code...

# %%
# store clean data
df.to_csv(str(product['data']), index=False)

This code will break if we run it: We declared raw as an upstream dependency, but we don’t know where to load our inputs from, or where to save our outputs.

When executing your pipeline, Ploomber injects a new cell into each script/notebooks, with new product and upstream variables that replace the original ones by extracting information from the pipeline.yaml:

injected-cell

As you can see in the image, the task in the picture has an upstream dependency called raw. Thus, the injected cell has a dictionary that gives you the products of raw, which we use as input, and a new product variable that we use to store our outputs.

The cell injection process also happens when opening the notebook/script in Jupyter. Learn more about Cell injection and Jupyter integration.

Note

When using jupyter notebook, scripts open automatically as notebooks. If using jupyter lab, you have to right-click and select the notebook option.

Since scripts/notebooks always create an executed notebook, you must specify where to save such a file. A typical task declaration looks like this:

tasks:
  - source: plot.py
    # output notebook
    product: output/plots.ipynb

If the source script/notebook generates more than one output, create a dictionary under product:

tasks:
  - source: plot.py
    product:
      # if the script generates other products, use "nb" for the notebok
      nb: output/plots.ipynb
      # ...and any other keys for other files
      data: output/data.csv

Note

The name of keys in the product dictionary can be chosen freely so as to be descriptive of the outputs (e.g. data, data_clean, model, etc.)

To inject cells manually, users can run:

ploomber nb --inject

However, if the same source appears more than once, --inject will pick the first declared task and inject those parameters.

Here is an example where template.ipynb appears in two different tasks

  tasks:
  - source: template.ipynb
     name: task-a
     product: output/template-task-a.ipynb
     params:
     some_param: param-a

  - source: template.ipynb
     name: task-a-suffix
     product: output/template-task-a-suffix.ipynb
     params:
     some_param: param-a-suffix

  - source: template.ipynb
     name: task-b-suffix
     product: output/template-task-b-suffix.ipynb
     params:
     some_param: param-b-suffix

By using the inject-priority parameter in setup.cfg, we can specify which set of parameters to inject:

To inject param-a to task-a :

  [ploomber]
  entry-point = path/to/pipeline.yaml
  inject-priority = task-a

Use wildcards to inject multiple parameters (*)

To inject param-a-suffix to task-a-suffix, and param-b-suffix to task-b-suffix :

  [ploomber]
  entry-point = path/to/pipeline.yaml
  inject-priority = *-suffix

This covers scripts and notebooks as tasks, if you want to learn how to use functions as tasks, keep scrolling, otherwise, skip to the end.

It is also possible to use placeholders in the pipeline.yaml file. For example, the following file uses a placeholder some_param.

# Content of pipeline.yaml
tasks:
- source: print.py
   name: print
   product:
      nb: 'output/{{some_param}}/notebook.html'
   papermill_params:
      log_output: True
   params:
      some_param: '{{some_param}}'

This placeholder should be defined inside the env.yaml file, more on that here.

Tasks: functions

You can also use functions as tasks, the following section explains how.

upstream and product

The only requirement for a function to be a valid task is to have a product parameter.

import pandas as pd

def clean(product):
   # save output using the product argument
   df.to_csv(product)

Note

If the function generates many products, this becomes a dictionary, for example: product['one'], and product['another'].

If the task has upstream dependencies, add an upstream parameter:

import pandas as pd

def clean(product, upstream):
   df_input = pd.read_csv(upstream['task_name'])
   df.to_csv(product)

When resolving dependencies, Ploomber will look for references such as upstream['task_name'], then, during execution, Ploomber will pass the requested inputs. For example, upstream={'task_name': 'path/to/product/from/upstream.csv'}

This covers scripts and functions as tasks, if you want to learn how to use SQL scripts as tasks, keep scrolling, otherwise, skip to the end.

Tasks: SQL

SQL tasks require more setup because you have to configure a client to connect to the database. We explain the product and upstream mechanism here; an upcoming guide describes how to configure database clients.

upstream and product

SQL scripts require placeholders for product and upstream. A script that has no upstream dependencies looks like this:

CREATE TABLE {{product}} AS -- {{product}} is a placeholder
SELECT * FROM my_table WHERE my_column > 10

In your pipeline.yaml file, specify product with a list of 3 or 2 elements: [schema, name, table] or [name, table]. If using a view, use [schema, name, view]. For example:

Say you have product: [schema, name, table] in your pipeline.yaml file. The {{product}} placeholder is replaced by schema.name:

CREATE TABLE schema.name AS
SELECT * FROM my_table WHERE my_column > 10

If the script has upstream dependencies, use the {{upstream['task_name']}} placeholder:

CREATE TABLE {{product}} AS
SELECT * FROM {{upstream['task_name']}} WHERE my_column > 10

{{upstream['task_name']}} tells Ploomber to run the task with the name 'task_name' and to replace {{upstream['task_name']}} with the product of such task.

Clients

To establish a connection with a database, you have to configure a client. All databases that have a Python driver are supported, including systems like Snowflake or Apache Hive. To learn more, see the SQL guide.

Where to go from here

We’ve created many runnable templates to help you get up and running, check out our Downloading templates guide.

If you want to read about advanced features, check out User Guide.

The pipeline.yaml API offers a concise way to declare pipelines, but if you want complete flexibility, you can use the underlying Python API, Click here to learn more, or click here to see an example.