Note: You can run this from your computer (Jupyter or terminal), or use one of the hosted options:

binder-logo

deepnote-logo

Your first Python pipeline

This guide shows you how to run your first Ploomber pipeline.

Note: This is intended for a quick and interactive experience. If you want to learn about Ploomber’s core concepts and design rationale, go to the the next tutorial

Setup (skip if using deepnote or binder)

Get code (run in a terminal):

git clone https://github.com/ploomber/projects
cd projects/spec-api-python

Install dependencies:

# if using conda
conda env create --file environment.yaml
conda activate spec-api-python

# otherwise use pip directly
pip install -r requirements.txt

Description

This pipeline contains 3 tasks, the first task get.py gets some data, clean.py cleans it and plot.py generates a visualization:

[1]:
%%sh
ls *.py
clean.py
plot.py
raw.py

These three scripts make up our pipeline (or DAG), which is a collection of tasks with a pre-defined execution order.

Note: These tasks are Python scripts, but you can use functions, notebooks, and even SQL scripts. The next guide explains how other types of tasks work.

Ploomber integrates with Jupyter. If you open the scripts inside the jupyter notebook app, they will render as notebooks. If you’re using jupyter lab, you need to right click -> open with -> Notebook as depicted below:

lab-open-with-nb

Along with the *.py files, there is a pipeline.yaml file where we declare which files we use as tasks:

[2]:
%%sh
cat pipeline.yaml
tasks:
  - source: raw.py
    product:
      nb: output/raw.ipynb
      data: output/data.csv

  - source: clean.py
    product:
      nb: output/clean.ipynb
      data: output/clean.csv

  - source: plot.py
    product: output/plot.ipynb

Note: The pipeline.yaml file is optional, but it gives you more flexibility. Click here to see an example without a pipeline.yaml file.

Let’s plot the pipeline:

[3]:
%%sh
# Note: plotting doesn't work in deepnote
ploomber plot
Plot saved at: pipeline.png
100%|██████████| 3/3 [00:00<00:00, 13766.86it/s]
[4]:

from IPython.display import Image
Image(filename='pipeline.png')
[4]:
../_images/get-started_spec-api-python_8_0.png

The status command gives us an overview of the pipeline:

[5]:
%%sh
ploomber status
name    Last run      Outdated?    Product       Doc (short)    Location
------  ------------  -----------  ------------  -------------  ------------
raw     Has not been  Source code  MetaProduct(                 /home/docs/c
        run                        {'data': Fil                 heckouts/rea
                                   e('output/da                 dthedocs.org
                                   ta.csv'),                    /user_builds
                                   'nb': File('                 /ploomber/ch
                                   output/raw.i                 eckouts/proj
                                   pynb')})                     ects-
                                                                master/spec-
                                                                api-python/r
                                                                aw.py
clean   Has not been  Source code  MetaProduct(                 /home/docs/c
        run           & Upstream   {'data': Fil                 heckouts/rea
                                   e('output/cl                 dthedocs.org
                                   ean.csv'),                   /user_builds
                                   'nb': File('                 /ploomber/ch
                                   output/clean                 eckouts/proj
                                   .ipynb')})                   ects-
                                                                master/spec-
                                                                api-python/c
                                                                lean.py
plot    Has not been  Source code  File('output                 /home/docs/c
        run           & Upstream   /plot.ipynb'                 heckouts/rea
                                   )                            dthedocs.org
                                                                /user_builds
                                                                /ploomber/ch
                                                                eckouts/proj
                                                                ects-
                                                                master/spec-
                                                                api-python/p
                                                                lot.py
100%|██████████| 3/3 [00:00<00:00, 13736.80it/s]

How is execution order determined?

Ploomber infers the pipeline structure from your code. If task B uses the output from task A as input, we say A is an upstream dependency of B. For example, to clean the data, we must get it first; hence, we declare the following in clean.py:

# execute 'raw" task before 'clean'
upstream = ['raw']

Once we finish cleaning the data, we must save it somewhere (this is known as a product). Products can be files or SQL relations. Our current example only generates files.

To specify where to save the output of each task, we use the product key. For example, the raw task definition looks like this:

- source: raw.py
  product:
    nb: output/raw.ipynb
    data: output/data.csv

Scripts and notebooks automatically generate a copy of themselves in Jupyter notebook format (.ipynb). That’s why we see a notebook in the product dictionary (nb key). The notebook format allows us to generate standalone files with charts and tables, no need to write extra code to save our charts!

Notebooks as pipeline products are crucial concepts: raw.py is part of the pipeline’s source code but output/raw.ipynb is not. It is an artifact generated by the source code.

If you don’t want to generate output notebooks, you can use a Python function as tasks. Our upcoming tutorial goes deeper into the different types of tasks available.

Building the pipeline

Let’s build the pipeline:

[6]:
%%sh
mkdir output
ploomber build
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
raw     True         37.186        90.2406
clean   True          1.50995       3.66426
plot    True          2.51166       6.09512
Building task 'raw':   0%|          | 0/3 [00:00<?, ?it/s]
Executing:   0%|          | 0/5 [00:00<?, ?cell/s]
Executing:  20%|██        | 1/5 [00:36<02:25, 36.41s/cell]
Executing: 100%|██████████| 5/5 [00:37<00:00,  7.43s/cell]
Building task 'clean':  33%|███▎      | 1/3 [00:37<01:14, 37.19s/it]
Executing:   0%|          | 0/5 [00:00<?, ?cell/s]
Executing: 100%|██████████| 5/5 [00:01<00:00,  3.33cell/s]
Building task 'plot':  67%|██████▋   | 2/3 [00:38<00:16, 16.20s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:01<00:10,  1.72s/cell]
Executing:  71%|███████▏  | 5/7 [00:01<00:00,  3.26cell/s]
Executing: 100%|██████████| 7/7 [00:02<00:00,  2.81cell/s]
Building task 'plot': 100%|██████████| 3/3 [00:41<00:00, 13.74s/it]

This pipeline saves all the output in the output/ directory; we have a few data files:

[7]:
%%sh
ls output/*.csv
output/clean.csv
output/data.csv

And a notebook for each script:

[8]:
%%sh
ls output/*.ipynb
output/clean.ipynb
output/plot.ipynb
output/raw.ipynb

Updating the pipeline

Quick experimentation is essential to develop a data pipeline. Ploomber allows you to quickly run new experiments without having to keep track of tasks dependencies.

Let’s say you found a problematic column in the data and want to add more cleaning logic to your clean.py script. raw.py does not depend on clean.py, but plot.py does. If you modify clean.py, you’d have to execute clean.py and then plot.py to bring your pipeline up-to-date.

As your pipeline grows, keeping track of task dependencies gets time-consuming. Ploomber does that for you and only executes outdated tasks on each run.

Make some changes to the clean.py script, then build again:

[9]:
%%sh
ploomber build
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
raw     False               0             0
clean   False               0             0
plot    False               0             0
0it [00:00, ?it/s]

You’ll see that raw.py didn’t run because it was not affected by the change! Try modifying any of the other tasks, then come back and run ploomber build.

Where to go from here

This tutorial showed how to build a pipeline with Ploomber; however, it only superficially covered Ploomber’s core concepts and design rationale. the upcoming tutorial goes deeper in those terms.