How to integrate Vizro with the Kedro Data Catalog
This page describes how to integrate Vizro with Kedro, an open-source Python framework to create reproducible, maintainable, and modular data science code. Vizro provides a convenient way to visualize Pandas datasets registered in a Kedro Data Catalog.
Even if you do not have a Kedro project, you can still use a Kedro Data Catalog to manage your dashboard's data sources. This separates configuration of your data from your app's code and is particularly useful for dashboards with many data sources or more complex data loading configuration.
Installation
If you already have Kedro installed then you do not need to install any extra dependencies. If you do not have Kedro installed then you should run:
Vizro is currently compatible with kedro>=0.19.0
and works with dataset factories for kedro>=0.19.9
.
Create a Kedro Data Catalog
You can create a Kedro Data Catalog to be a YAML registry of your dashboard's data sources. To do so, create a new file called catalog.yaml
file in the same directory as your app.py
. Below is an example catalog.yaml
file that illustrates some of the key features of the Kedro Data Catalog.
cars: # (1)!
type: pandas.CSVDataset # (2)!
filepath: cars.csv
motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv # (3)!
load_args: # (4)!
sep: ','
na_values: [NA]
credentials: s3_credentials # (5)!
trains:
type: pandas.ExcelDataset
filepath: trains.xlsx
load_args:
sheet_name: [Sheet1, Sheet2, Sheet3]
trucks:
type: pandas.ParquetDataset
filepath: trucks.parquet
load_args:
columns: [name, gear, disp, wt]
categories: list
index: name
- The minimum details needed for a Kedro Data Catalog entry are the data source name (
cars
), the type of data (type
), and the file's location (filepath
). - Vizro supports all
kedro_datasets.pandas
datasets. This includes, for example, CSV, Excel and Parquet files. - Kedro supports a variety of data stores including local file systems, network file systems and cloud object stores.
- You can pass data loading arguments to specify how to load the data source.
- You can securely inject credentials into data loading functions using a
credentials.yaml
file or environment variables.
As shown below, the best way to use the catalog.yaml
is with the Kedro configuration loader OmegaConfigLoader
. For simple cases, this functions much like yaml.safe_load
. However, the Kedro configuration loader also enables more advanced functionality.
Kedro configuration loader features
Here are a few features of the Kedro configuration loader which are not possible with a yaml.safe_load
alone. For more details, refer to Kedro's documentation on advanced configuration.
- Configuration environments to organize settings that might be different between your different development and production environments. For example, you might have different s3 buckets for development and production data.
- Recursive scanning for configuration files to merge complex configuration that is split across multiple files and folders.
- Templating (variable interpolation) and dynamically computed values (resolvers).
Use datasets from the Kedro Data Catalog
Vizro provides functions to help generate and process a Kedro Data Catalog in the module vizro.integrations.kedro
. These functions support both the original DataCatalog
and the more recently introduced KedroDataCatalog
. Given a Kedro catalog
, the general pattern to add datasets to the Vizro data manager is:
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
The code above registers all data sources of type kedro_datasets.pandas
in the Kedro catalog
with Vizro's data_manager
. You can now reference the data source by name. For example, given the above catalog.yaml
file, you could use the data source names "cars"
, "motorbikes"
, "trains"
, and "trucks"
with px.scatter("cars", ...)
.
Note
Data sources imported from Kedro in this way are dynamic data. This means that the data can be refreshed while your dashboard is running. For example, if you run a Kedro pipeline, the latest data is shown in the Vizro dashboard without restarting it.
The catalog
variable may have been created in a number of different ways:
- Data Catalog configuration file (
catalog.yaml
), created as described above. This generates acatalog
variable independently of a Kedro project usingDataCatalog.from_config
. - Kedro project path. Vizro exposes a helper function
catalog_from_project
to generate acatalog
given the path to a Kedro project. - Kedro Jupyter session. This automatically exposes
catalog
.
The full code for these different cases is given below.
Import a Kedro Data Catalog into the Vizro data manager
from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog # (1)!
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
conf_loader = OmegaConfigLoader(conf_source=".") # (2)!
catalog = DataCatalog.from_config(conf_loader["catalog"]) # (3)!
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
- Kedro's experimental
KedroDataCatalog
would also work. - This loads and parses configuration in
catalog.yaml
. The argumentconf_source="."
specifies thatcatalog.yaml
is found in the same directory asapp.py
or a subdirectory beneath this level. In a more complex setup, this could include configuration environments, for example to organize configuration for development and production data sources. - If you have credentials then these can be injected with
DataCatalog.from_config(conf_loader["catalog"], conf_loader["credentials"])
.
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
Use dataset factories
To add datasets that are defined using a Kedro dataset factory, datasets_from_catalog
needs to resolve dataset patterns against explicit datasets. Given a Kedro pipelines
dictionary, you should specify a pipeline
argument as follows:
- You can specify the name of your pipeline, for example
pipelines["my_pipeline"]
, or even combine multiple pipelines withpipelines["a"] + pipelines["b"]
. The Kedro__default__
pipeline is what runs by default with thekedro run
command.
The pipelines
variable may have been created the following ways:
- Kedro project path. Vizro exposes a helper function
pipelines_from_project
to generate apipelines
given the path to a Kedro project. - Kedro Jupyter session. This automatically exposes
pipelines
.
The full code for these different cases is given below.
Import a Kedro Data Catalog with dataset factories into the Vizro data manager
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
pipelines = kedro_integration.pipelines_from_project(project_path)
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
catalog, pipeline=pipelines["__default__"]
).items():
data_manager[dataset_name] = dataset_loader