Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data
. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2
from the two inputs feature_1
and feature_2
. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
- The input datasets for a node in layer
L
should all be in the same layer, which can be either L
or L-1
- The output datasets for a node in layer
L
should all be in the same layer L
, which can be either L
or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.