FAQ and Glossary¶
Why do products have clients?¶
Clients exist in tasks and products because they serve different purposes. A task client manages the connection to the database that runs your script. On the other hand, the product’s client only handles the storage of the product’s metadata.
To enable incremental runs. Ploomber has to store the source code that generated
any given product. Storing metadata in the same database that runs your code
requires a system-specific implementation. Currently, only SQLite and PostgreSQL
are supported via
ploomber.products.PostgresRelation respectively. For these two cases,
task client and product client communicate to the same system (the database).
Hence they can initialize with the same client.
For any other database, we provide two alternatives; in both cases, the
task’s client is different from the product’s client. The first alternative
ploomber.products.GenericSQLRelation which represents a generic
table or view and saves metadata in a SQLite database; on this case, the
task’s client is the database client (e.g., Oracle, Hive, Snowflake) but
the product’s client is a SQLite client. If you don’t need the incremental
builds features, you can use
which is a product with no metadata.
Which databases are supported?¶
The answer depends on the task to use. There are two types of database clients.
ploomber.clients.DBAPIClient for the rest (the only
DBAPIClient is a driver that implements
ploomber.tasks.SQLDump supports both types of clients.
ploomber.tasks.SQLScript supports both types of clients. But if you
want incremental builds, you must also configure a product client. See the section
below for details.
ploomber.tasks.SQLUpload relies on pandas.to_sql to upload a local
file to a database. Such method relies on SQLAlchemy to work. Hence it only
ploomber.tasks.PostgresCopyFrom is a faster alternative to
SQLUpload when using PostgreSQL. It relies on pandas.to_sql only
to create the database, but actual data upload is done using
which calls the native
COPY FROM procedure.
What are incremental builds?¶
When developing pipelines, we usually make small changes and want to see how the
the final output looks like (e.g., add a feature to a model training pipeline).
Incremental builds allow us to skip redundant work by only executing tasks
whose source code has changed since the last execution. To do so, Ploomber
has to save the Product’s metadata. For
ploomber.products.File, it creates
another file in the same location, for SQL products such as
ploomber.products.SQLRelation, a metadata backend is required, which
is configured using the
Dotted path. A dot-separated string pointing to a Python module/class/function, e.g. “my_module.my_function”.
Entry point. A location to tell Ploomber how to initialize a DAG, can be a spec file, a directory, or a dotted path
Hook. A function executed after a certain event happens, e.g., the task “on finish” hook executes after the task executes successfully
Spec. A dictionary-like specification to initialize a DAG, usually provided via a YAML file