1

I am trying to add layer attributes to my catalog. One common pattern I have is to get some data(raw), clean it up, then output a list of parts(pri). I then need metadata for those parts in which I take the list of parts from pri and pass into a function that gets data (raw). The pipeline itself is not circular, but kedro does not seem to like when I create circular layers.

Is there a common pattern that I am missing for this use case?

Would it be possible to allow layers to be circular?

Example

I have tried to put together a generic example below.


raw_truck_sales:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: raw

int_truck_sales:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: int

pri_truck_sales:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: pri

pri_truck_sold_models:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: pri

raw_truck_metadata:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: raw

int_truck_metadata:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: int

pri_truck_metadata:
  type: pandas.ParquetDataSet
  filepath: <filepath>
  layer: pri
nodes = [
    node(
        get_truck_sales,
        inputs=None,
        outputs='raw_truck_sales',
    ),
    node(
        create_int_truck_sales,
        inputs='raw_truck_sales',
        outputs='int_truck_sales',
    ),
    node(
        create_pri_truck_sales,
        inputs='int_truck_sales',
        outputs='pri_truck_sales',
    ),
    node(
        lambda truck_sales: truck_sales[['model']],
        inputs='pri_truck_sales',
        outputs='pri_truck_models_sold',
    ),

    # This node takes the list of trucks sold and gets metadata for them
    # It seems to break kedros layers model by creating a circular reference
    node(
        get_truck_metadata,
        inputs='pri_truck_models_sold',
        outputs='raw_truck_metadata',
    ),
    node(
        create_int_truck_metadata,
        inputs='raw_truck_metadata',
        outputs='int_truck_metadata',
    ),
    node(
        create_pri_truck_metadata,
        inputs='int_truck_metadata',
        outputs='pri_truck_metadata',
    ),
]
Waylon Walker
  • 543
  • 3
  • 10
  • I think the best thing to do in this case is to just remove the layer information altogether from that problematic dataset (`pri_truck_sold_models`). Viz is smart enough to visualise it in a logical place based on the node's topological order. We do error out circular layers on viz because layers are linear by definition, at least visually. – Lim H. Sep 02 '20 at 15:30
  • Thanks for the feedback @LimH! That makes sense. I think for us layers are not going to work very well. It is often that we work on a dataset that contains a subset of products (`trucks`) in the example above. and we need to get a very small subset of data from a much larger dataset that might contain more than just the subset. ---- Using the example above. In some extreme cases getting all of the metadata literally takes days while the `trucks` subset takes seconds. So we need to have some circularity. – Waylon Walker Sep 02 '20 at 18:05

2 Answers2

2

Oh, hey Waylon! Haha.

Can you please post the entire stack trace that shows the error?

I've copied your pipeline, and it visualizes just fine, for me, which means there are no circular dependencies. Perhaps there are other nodes that you have not listed here which are affecting your output?

EDIT: Lim Hoang just pointed out that your example has c_pro_truck_models_sold, and if that was pro_truck_models_sold, then that would be the cyclic.

Lim and I agree that dropping the layers would be your best bet. The kedro visualization isn't really hurt by the loss, anyway, as long as the surrounding nodes have their layers intact.

See the following image for proof.

dropped layer viz

tamsanh
  • 76
  • 1
  • 5
  • Fixed the typo, thanks Tam. This now makes sense thanks to you and @LimH. I was hoping to get better alignment between layers, but I am ok not using layers. – Waylon Walker Sep 02 '20 at 18:11
1

The circular layer relationship you describe doesn't align with how data layers were originally designed, with restrictions on which layers feed to other layers:

| Layer        | Input Layer                                 | Output Layer                                           |
|--------------|---------------------------------------------|--------------------------------------------------------|
| Reference    |                                             | Primary, Feature, Model Input, Model Output, Reporting |
| Raw          |                                             | Intermediate, Primary                                  |
| Intermediate | Raw                                         | Primary                                                |
| Primary      | Raw, Intermediate, Reference                | Feature, Reporting                                     |
| Feature      | Primary, Reference                          | Model Input, Reporting                                 |
| Model Input  | Feature, Reference                          | Reporting                                              |
| Model Output | Model Input                                 | Reporting                                              |
| Reporting    | Primary, Feature, Model Input, Model Output |                                                        |

Kedro doesn't enforce this structure (or any particular set of layers), but it helps support it. Therefore, from a best practices perspective, circular dependencies between data layers should be avoided.

deepyaman
  • 538
  • 5
  • 16