0

I am looking at Kedro Library as my team are looking into using it for our data pipeline.

While going to the offical tutorial - Spaceflight.

I came across this function:

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocess the data for companies.

    Args:
        companies: Source data.
    Returns:
        Preprocessed data.

"""

companies["iata_approved"] = companies["iata_approved"].apply(_is_true)

companies["company_rating"] = companies["company_rating"].apply(_parse_percentage)

return companies
  • companies is the name of the csv file containing the data

Looking at the function, my assumption is that (companies: pd.Dafarame) is the shorthand to read the "companies" dataset as a dataframe. If so, I do not understand what does -> pd.Dataframe at the end means

I tried looking at python documentation regarding such style of code but I did not managed to find any

Much help is appreciated to assist me in understanding this.

Thank you

3 Answers3

1

This is tht way of declaring type of your inputs(companies: pd.DataFrame) . Here comapnies is argument and pd.DataFrame is its type . in same way -> pd.DataFrame this is the type of output Overall they are saying that comapnies of type pd.DataFrame will return pd.DataFrametype variable . I hope you got it

0

The -> notation is type hinting, as is the : part in the companies: pd.DataFrame function definition. This is not essential to do in Python but many people like to include it. The function definition would work exactly the same if it didn't contain this but instead read:

def preprocess_companies(companies):

This is a general Python thing rather than anything kedro-specific.

The way that kedro registers companies as a kedro dataset is completely separate from this function definition and is done through the catalog.yml file:

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

There will then a node defined (in pipeline.py) to specify that the preprocess_companies function should take as input the kedro dataset companies:

node(
    func=preprocess_companies,
    inputs="companies",  # THIS LINE REFERS TO THE DATASET NAME
    outputs="preprocessed_companies",
    name="preprocessing_companies",
),

In theory the name of the parameter in the function itself could be completely different, e.g.

def preprocess_companies(anything_you_want):

... although it is very common to give it the same name as the dataset.

Antony Milne
  • 141
  • 3
0

In this situation companies is technically any DataFrame. However, when wrapped in a Kedro Node object the correct dataset will be passed in:

Node( 
   func=preprocess_companies, # The function posted above
   inputs='raw_companies', # Kedro will read from a catalog entry called 'raw companies'
   outputs='processed_companies', # Kedro will write to a catalog entry called 'processed_companies'
)

In essence the parameter name isn't really important here, it has been named this way so that the person reading the code knows that it is semantically about companies, but the function name does that too.

The above is technically a simplification since I'm not getting into MemoryDataSets but hopefully it covers the main points.

datajoely
  • 1,466
  • 10
  • 13