0

My data is in wide format as such:

data = [{
id: '1',
'timestamp': '2021-10-01',
'product_type' : 'Quarterly',
'applicable_tariff_per_local_currency_kwh_d_value' : 1.11,
'applicable_tariff_per_local_currency_kwh_d_unit' : 'CZK/(kWh/d)/q',

'applicable_tariff_per_local_currency_kwh_h_value' : 11.11,
'applicable_tariff_per_local_currency_kwh_h_unit' : 'CZK/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_h_value' : 0.1111,
'applicable_tariff_per_eur_kwh_h_unit' : 'Euro/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_d_value' : 0.0011,
'applicable_tariff_per_eur_kwh_d_unit' : 'Euro/(kWh/d)/q',

'applicable_tariff_in_common_unit_value' : 0.0001111,
'applicable_tariff_in_common_unit_unit' : 'Euro/(kWh/h)/d'
},
{
'id': '2',
'timestamp': '2021-10-01',
'product_type' : 'Quarterly',
'applicable_tariff_per_local_currency_kwh_d_value' : 2.22,
'applicable_tariff_per_local_currency_kwh_d_unit' : 'CZK/(kWh/d)/q',

'applicable_tariff_per_local_currency_kwh_h_value' : 22.22,
'applicable_tariff_per_local_currency_kwh_h_unit' : 'CZK/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_h_value' : 0.2222,
'applicable_tariff_per_eur_kwh_h_unit' : 'Euro/(kWh/h)/q',

'applicable_tariff_per_eur_kwh_d_value' : 0.0022,
'applicable_tariff_per_eur_kwh_d_unit' : 'Euro/(kWh/d)/q',

'applicable_tariff_in_common_unit_value' : 0.0002222,
'applicable_tariff_in_common_unit_unit' : 'Euro/(kWh/h)/d'
}
]

df = pd.DataFrame(data)

I would like to pivot longer this data into the following specified format:

pivot_data = [{
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'CZK', # Can be obtained through regex on code
'unit': 'kWh/h', # Can be obtained through regex on code
'product_code': 'q', # Can be obtained through regex on code
'code': 'CZK/(kWh/d)/q',
'value': 1.11
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'CZK', 
'unit': 'kWh/h',
'product_code': 'q', 
'code': 'CZK/(kWh/h)/q',
'value': 11.11
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'Euro',
'unit': 'kWh/h', 
'product_code': 'q', 
'code': 'Euro/(kWh/h)/q',
'value': 0.1111
},
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'Quarterly',
'currency': 'Euro', 
'unit': 'kWh/d', 
'product_code': 'q', 
'code': 'Euro/(kWh/d)/q',
'value': 0.001111
},
# This one is extra and not super neccesary...
'id': '1',
'timestamp': '2021-10-01',
'product_type': 'common_unit',
'currency': 'Euro', 
'unit': 'kWh/h', 
'product_code': 'd', 
'code': 'Euro/(kWh/d)/d',
'value': 0.001111
}]

df_pivoted = pd.DataFrame(pivot_data)

I know I can do this with pd.pivot_wider_to_longer() or pandas.merge, however I cannot seem to get it to work. Is anyone able to shed more light on this? I've added this example as it was the data I am struggling with, however if you have a more simple dataset you could show me an example of this as well!

I did not add id 2 as it would be too much data in the pivoted_longer dataset.

Thanks in advance.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

2 Answers2

0

Because you require two related columns (value, unit) in the output, you need to run melt twice:

In     : d1 = pd.melt(
    ...:     df,
    ...:     id_vars=["id"],
    ...:     value_vars=[
    ...:         "applicable_tariff_per_local_currency_kwh_d_value",
    ...:         "applicable_tariff_per_local_currency_kwh_h_value",
    ...:         "applicable_tariff_per_eur_kwh_h_value",
    ...:         "applicable_tariff_per_eur_kwh_d_value",
    ...:         "applicable_tariff_in_common_unit_value",
    ...:     ],
    ...: )

In     : d1.head(n=2)
Out    :
  id                                          variable  value
0  1  applicable_tariff_per_local_currency_kwh_d_value   1.11
1  2  applicable_tariff_per_local_currency_kwh_d_value   2.22

and same thing for for the unit columns:

In     : d2 = pd.melt(
    ...:     df,
    ...:     id_vars=["id"],
    ...:     value_vars=[
    ...:         "applicable_tariff_per_local_currency_kwh_d_unit",
    ...:         "applicable_tariff_per_local_currency_kwh_h_unit",
    ...:         "applicable_tariff_per_eur_kwh_h_unit",
    ...:         "applicable_tariff_per_eur_kwh_d_unit",
    ...:         "applicable_tariff_in_common_unit_unit",
    ...:     ],
    ...: )

You can now merge these two dataframes to get to your desired result:

d1["variable"]= d1["variable"].str.replace("_value", "")
d2["variable"] = d2["variable"].str.replace("_unit$", "", regex=True)
d2 = d2.rename(columns={"value": "code"})
df_pivoted = d1.merge(d2, on=["id", "variable"]).drop(columns=["variable"])

For the rest you can use pandas string operations to split out the code columns and you can merge with the original dataframe to get the timestamp and product_type info back but this should be clear enough.

nick
  • 1,310
  • 8
  • 15
  • 1
    Wow, I will try this out. Thank you so much! – VolatilitySmile Feb 22 '22 at 10:53
  • Just saw there was a slight bug in the replace methods. I over complicated it so there might have been an issue when things were merged... I edited the answer to fix that. – nick Feb 22 '22 at 11:06
0

One option is with pivot_longer from pyjanitor:

# pip install pyjanitor

import janitor 

(df
.pivot_longer(
    index = slice('id', 'product_type'), 
    names_to = '.value', 
    names_pattern = '.+_(.+)')
.assign(
    unit_extract = lambda df: df.unit.str.extract(r".+/\((.+)\)/.+"), 
    product_code = lambda df: df.unit.str[-1], 
    currency = lambda df: df.unit.str.extract(r"(.+?)/"))
)
  id   timestamp product_type      value            unit unit_extract product_code currency
0  1  2021-10-01    Quarterly   1.110000   CZK/(kWh/d)/q        kWh/d            q      CZK
1  2  2021-10-01    Quarterly   2.220000   CZK/(kWh/d)/q        kWh/d            q      CZK
2  1  2021-10-01    Quarterly  11.110000   CZK/(kWh/h)/q        kWh/h            q      CZK
3  2  2021-10-01    Quarterly  22.220000   CZK/(kWh/h)/q        kWh/h            q      CZK
4  1  2021-10-01    Quarterly   0.111100  Euro/(kWh/h)/q        kWh/h            q     Euro
5  2  2021-10-01    Quarterly   0.222200  Euro/(kWh/h)/q        kWh/h            q     Euro
6  1  2021-10-01    Quarterly   0.001100  Euro/(kWh/d)/q        kWh/d            q     Euro
7  2  2021-10-01    Quarterly   0.002200  Euro/(kWh/d)/q        kWh/d            q     Euro
8  1  2021-10-01    Quarterly   0.000111  Euro/(kWh/h)/d        kWh/h            d     Euro
9  2  2021-10-01    Quarterly   0.000222  Euro/(kWh/h)/d        kWh/h            d     Euro

In the above solution, the .value determines which parts of the column labels remain as headers - the labels are split apart with the names_pattern. The rest is just string manipulation with Pandas' string methods.

halfer
  • 19,824
  • 17
  • 99
  • 186
sammywemmy
  • 27,093
  • 4
  • 17
  • 31