FAQ and Glossary

Why do products have clients?

Clients exist in tasks and products because they serve different purposes. A task client manages the connection to the database that runs your script. On the other hand, the product’s client only handles the storage of the product’s metadata.

To enable incremental runs. Ploomber has to store the source code that generated any given product. Storing metadata in the same database that runs your code requires a system-specific implementation. Currently, only SQLite and PostgreSQL are supported via ploomber.products.SQLiteRelation and ploomber.products.PostgresRelation respectively. For these two cases, task client and product client communicate to the same system (the database). Hence they can initialize with the same client.

For any other database, we provide two alternatives; in both cases, the task’s client is different from the product’s client. The first alternative is ploomber.products.GenericSQLRelation which represents a generic table or view and saves metadata in a SQLite database; on this case, the task’s client is the database client (e.g., Oracle, Hive, Snowflake) but the product’s client is a SQLite client. If you don’t need the incremental builds features, you can use ploomber.products.SQLRelation instead which is a product with no metadata.

Which databases are supported?

The answer depends on the task to use. There are two types of database clients. ploomber.clients.SQLAlchemyClient for SQLAlchemy compatible database and ploomber.clients.DBAPIClient for the rest (the only requirement for DBAPIClient is a driver that implements PEP 249.

ploomber.tasks.SQLDump supports both types of clients.

ploomber.tasks.SQLScript supports both types of clients. But if you want incremental builds, you must also configure a product client. See the section below for details.

ploomber.tasks.SQLUpload relies on pandas.to_sql to upload a local file to a database. Such method relies on SQLAlchemy to work. Hence it only supports SQLAlchemyClient.

ploomber.tasks.PostgresCopyFrom is a faster alternative to SQLUpload when using PostgreSQL. It relies on pandas.to_sql only to create the database, but actual data upload is done using psycopg which calls the native COPY FROM procedure.

What are incremental builds?

When developing pipelines, we usually make small changes and want to see how the the final output looks like (e.g., add a feature to a model training pipeline). Incremental builds allow us to skip redundant work by only executing tasks whose source code has changed since the last execution. To do so, Ploomber has to save the Product’s metadata. For ploomber.products.File, it creates another file in the same location, for SQL products such as ploomber.products.SQLRelation, a metadata backend is required, which is configured using the client parameter.

How do I specify a task with a variable number of outputs?

You must group the outputs into a single product and declare it as a directory.

  • Click here to see an example.

  • If you’re using serializers, click here to see an example.

Should tasks generate products?

Yes. Tasks must generate at least one product; this is typically a file but can be a table or view in a database.

If you find yourself trying to write a task that generates no outputs, consider the following options:

  1. Merge the code that does not generate outputs with upstream tasks that generate outputs.

  2. Use the on_finish hook to execute code after a task executes successfully (click here to learn more).

Auto reloading code in Jupyter

When you import a module in Python (e.g., from module import my_function), the system caches the code and subsequent changes to my_funcion won’t take effect even if you run the import statement again until you restar the kernel, which is inconvenient if you are iterating on some code stored in an external file.

To overcome such limitation, you can insert the following at the top of your notebook, before any import statements:

# auto reload modules
%load_ext autoreload
%autoreload 2

Once executed, any updates to imported modules will take effect if you change the source code. Note that this feature has some limitations.

Glossary

  1. Dotted path. A dot-separated string pointing to a Python module/class/function, e.g. “my_module.my_function”.

  2. Entry point. A location to tell Ploomber how to initialize a DAG, can be a spec file, a directory, or a dotted path

  3. Hook. A function executed after a certain event happens, e.g., the task “on finish” hook executes after the task executes successfully

  4. Spec. A dictionary-like specification to initialize a DAG, usually provided via a YAML file