Ploomber
graph LR
r1[Get
dataset A] --> c1[Clean] --> f1[Features] --> f[Join features] f --> t[Train model] --> e[Evaluate model] r2[Get
dataset B] --> c2[Clean] --> f2[Features] --> f class r1 done; class c1 done; class f1 done; class r2 pending; class c2 pending; class f2 pending; class f pending; class t pending; class e pending;
dataset A] --> c1[Clean] --> f1[Features] --> f[Join features] f --> t[Train model] --> e[Evaluate model] r2[Get
dataset B] --> c2[Clean] --> f2[Features] --> f class r1 done; class c1 done; class f1 done; class r2 pending; class c2 pending; class f2 pending; class f pending; class t pending; class e pending;
Coding an entire analysis pipeline in a single notebook file allows you to develop your code interactively,
but it creates an unmaintainable monolith that easily breaks. Ploomber allows you to modularize your
analysis in smaller tasks without losing the power of an interactive notebook.
Ploomber is the simplest way to turn your notebooks, (Python/R/SQL) scripts or Python functions into a
reproducible data
pipeline.
Simple
- (Optional) List your pipeline scripts in a
pipeline.yaml
file - Inside each notebook (or script), state dependencies via an
upstream
variable - Use a
product
variable to declare output file(s) that the next notebook (or script) will use as inputs
Powerful
- Incremental builds (skip up-to-date tasks)
- Pipeline testing
- Pipeline inspection and debugging
Integrates with Jupyter
- Automatically inject a new cell with the location of your input files, as inferred from your
upstream
variable - Python and R scripts are converted to a notebook on the fly