Batch processing

Ploomber pipelines can export to production-grade schedulers for batch processing. Check out our package Soopervisor, which allows you to export to Kubernetes (via Argo workflows), AWS Batch, and Airflow.

Composing batch pipelines

To compose a batch pipeline, use the import_tasks_from directive in your pipeline.yaml file.

For example, define your feature generation tasks in a features.yaml file:

# generate one feature...
- source: features.a_feature
  product: features/a_feature.csv

# another feature...
- source: features.anoter_feature
  product: features/another_feature.csv

# join the two previous features...
- source: features.join
  product: features/all.csv

Then import those tasks in your training pipeline, pipeline.yaml:

meta:
    # import feature generation tasks
    import_tasks_from: features.yaml

tasks:
    # Get raw data for training
    - source: train.get_historical_data
      product: raw/get.csv

    # The import_tasks_from injects your features generation tasks here

    # Train a model
    - source: train.train_model
      product: model/model.pickle

Your serving pipeline pipepline-serve.yaml would look like this:

meta:
    # import feature generation tasks
    import_tasks_from: features.yaml

tasks:
    # Get new data for predictions
    - source: serve.get_new_data
      product: serve/get.parquet

    # The import_tasks_from injects your features generation tasks here

    # Make predictions using a trained model
    - source: serve.predict
      product: serve/predictions.csv
      params:
        path_to_model: model.pickle

Example

Here’s an example project showing how to use import_tasks_from to create a training (pipeline.yaml) and serving (pipeline-serve.yaml) pipeline.