Kedro is an open source Python library that helps you build production-ready data and analytics pipelines
Questions tagged [kedro]
202 questions
2
votes
1 answer
Load existing data catalog programmatically
I want to write pytest unit test in Kedro 0.17.5. They need to perform integrity checks on dataframes created by the pipeline.
These dataframes are specified in the catalog.yml and already persisted successfully using kedro run. The catalog.yml is…

movingabout
- 343
- 3
- 10
2
votes
3 answers
Python Kedro PySpark : py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext
it's my first project using kedro with Pyspark and I have an issue. I work with the new Mac (M1). When I do spark-shell in the terminal, spark is successfully installed and I have the right output (welcome to spark version 3.2.1 with the picture).…

Mathilde Roblot
- 41
- 1
- 1
- 4
2
votes
2 answers
Is there a package in R that mimics KEDRO as a modular collaborative framework for development?
I currently work with Kedro (from quantum black https://kedro.readthedocs.io/en/stable/01_introduction/01_introduction.html) as a framework for deployment oriented framework to code collaboratively. It is a great framework to develop machine…

Felipe Alvarenga
- 2,572
- 1
- 17
- 36
2
votes
1 answer
Waiting for nodes to finish in Kedro
I have a pipeline in Kedro that looks like this:
from kedro.pipeline import Pipeline, node
from .nodes import *
def foo():
return Pipeline([
node(a, inputs=["train_x", "test_x"], outputs=dict(bar_a="bar_a"), name="A"),
node(b,…

João Areias
- 1,192
- 11
- 41
2
votes
1 answer
Kedro install fail to install, but few attempt later it is successful
I have to test if my kedro project works from github so I create a new environment, then :
git clone
pip install kedro kedro[pandas] kedro-viz jupyter
kedro build-reqs
kedro install
and the install fails, then I retry a few time…

Charles Roy
- 23
- 4
2
votes
2 answers
Kedro Data Modelling
We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction…

SinisterPenguin
- 1,610
- 15
- 17
2
votes
2 answers
Kedro context and catalog missing from Jupyter Notebook
I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my…

Pierre Delecto
- 455
- 1
- 7
- 26
2
votes
2 answers
How do I add a directory of .wav files to the Kedro data catalogue?
This is my first time trying to use the Kedro package.
I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.
Any thoughts?

Myccha
- 961
- 1
- 11
- 20
2
votes
1 answer
Why my Kedro logging file keeps empty? Am I missing any step?
I am using Kedro but I can't get my logging file to be used. I am following the tutorial. The log file was created but is still empty.
Steps done:
Configured logging
class ProjectContext(KedroContext):
def _setup_logging(self) -> None:
…

Antunes
- 41
- 4
2
votes
1 answer
PartitionedDataSet not found when Kedro pipeline is run in Docker
I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this:
raw_data:
type: PartitionedDataSet
path: s3://reads/raw
dataset: pandas.CSVDataSet
load_args:
…

mendo
- 86
- 5
2
votes
1 answer
How to catalog datasets & models by S3 URI, but keep a local copy?
I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath:…

crypdick
- 16,152
- 7
- 51
- 74
2
votes
1 answer
Does kedro support tfrecord?
To train tensorflow keras models on AI Platform using Docker containers, we convert our raw images stored on GCS to a tfrecord dataset using tf.data.Dataset. Thereby the data is never stored locally. Instead the raw images are transformed directly…

evolved
- 1,850
- 19
- 40
2
votes
1 answer
Dynamic instance of pipeline execution based on dataset partition/iterator logic
Not sure if this is possible or not, but this is what I am trying to do: -
I want to extract out portions (steps) of a function as individual nodes (ok so far), but the catch is I have an iterator on top of steps, which is dependent on some logic on…

Mohit
- 1,045
- 4
- 18
- 45
2
votes
1 answer
Does Kedro support Checkpointing/Caching of Results?
Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro…

Sir ExecLP
- 83
- 1
- 5
2
votes
2 answers
Passing nested parameters in the extra_params of the load_context in Kedro
I am trying to load a Kedro context with some extra parameters. My intention is to update the configs in parameters.yml with only the ones passed in extra_params (so rest of the configs should remain same). I will then use this instance of context…

Mohit
- 1,045
- 4
- 18
- 45