0

I am setting up a config.yaml file for use in a .ipynb notebook file which pre-processes some data frames.

Currently I have a setup like this in my .yaml file:

          inputs:
            path:
              - 'data/raw/some_path.csv'
            columns:
              Product Description: 'product_desc'
              Product: 'product_code'
              Customer: 'retailer'
              Brand Name: 'brand'
              Order Date: 'date'
            dtypes:
              product_desc: 'str'
              product_code: 'int'
              retailer: 'str'
              brand: 'str'
              date: 'datetime'
            entries:
              retailer:
                COMPANY 123: 'company_123'
                COMPANY 456: 'company_456'
              brand:
                BRAND 123: 'brand_123'
                BRAND 456: 'brand_456'

It is set up so that I can write very concise code in my .ipynb file like this:

pd.read_csv(inputs['path']).rename(columns=inputs['columns']).astype(inputs['dtypes']).replace(inputs['entries'])

But I am trying to use best practices here, and my worry is that the .yaml file is quite repetitive. For example, 'retailer' and 'brand' are referred to 3 times, so if I ever wanted to change the column name to 'Retailer' instead, then I'd have to change all 3 entries in the .yaml file. The same is true for all the other column names, written more than once in the .yaml file.

But isn't part of the point of having a config.yaml file being able to change 1 word and have the script run differently? In that case, how should I set up my .yaml file in such a way that doesn't sacrifice the readability and conciseness of my Python code which reads and uses it?

Elis
  • 70
  • 10
  • You are using mappings, for which the order is not guaranteed, and the repetition of keys is necessary to make an explicit connection between the columns, dtypes and entries.. If you switch to using sequences, you can use corresponding positions in the sequence to implicitly make the connection. – Anthon Jun 30 '23 at 12:29

0 Answers0