2

I have a dataframe with the following simplified structure:

import pandas as pd

d = {'id': [1, 2, 3, 4, 5],
     'name': ["a", "b", "c", "d", "e"],
     'country': ["uk", "spain", "france", "germany", "italy"],
     'cat_01_2020': [10, 20, 30, 40, 50],
     'cat_01_2019': [11, 21, 31, 41, 51],
     'cat_01_2018': [12, 22, 32, 42, 52],
     'cat_02_2020': [100, 200, 300, 400, 500],
     'cat_02_2019': [111, 211, 311, 411, 511],
     'cat_02_2018': [122, 222, 322, 422, 522],
     'cat_03_2020': [1000, 2000, 3000, 4000, 5000],
     'cat_03_2019': [1111, 2111, 3111, 4111, 5111],
     'cat_03_2018': [1222, 2222, 3222, 4222, 5222]}

df = pd.DataFrame(data = d)

df_image

and I want to get this new df_target.

d_target = {'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
     'name': ["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
     'country': ["uk", "uk", "uk", "spain", "spain", "spain", "france", "france", "france", "germany", "germany", "germany", "italy", "italy", "italy"],
     'year': [2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018],
     'cat_01': [10, 11, 12, 20, 21, 22, 30, 31, 32, 40, 41, 42, 50, 51, 52],
     'cat_02': [100, 111, 122, 200, 211, 222, 300, 311, 322, 400, 411, 422, 500, 511, 522],
     'cat_03': [1000, 1111, 1222, 2000, 2111, 2222, 3000, 3111, 3222, 4000, 4111, 4222, 5000, 5111, 5222]}

df_target = pd.DataFrame(data = d_target)

df_target

df_target_image

To do this, I think I need to use the pandas crosstab function, first getting the years 2018, 2019 and 2020. Then I should be able to get cat_01, cat_02 and cat_03.

Does anyone know how I can do this?

Thank you very much in advance.

Best regards.

Gusti Adli
  • 1,225
  • 4
  • 13

1 Answers1

3

What you are looking for is probably wide_to_long:

pd.wide_to_long(df, 
                stubnames = ['cat_01', 'cat_02', 'cat_03'], 
                i = ['id', 'name', 'country'], 
                j = 'year', 
                sep = '_', 
                suffix = r"\d+").reset_index()
 
    id name  country  year  cat_01  cat_02  cat_03
0    1    a       uk  2020      10     100    1000
1    1    a       uk  2019      11     111    1111
2    1    a       uk  2018      12     122    1222
3    2    b    spain  2020      20     200    2000
4    2    b    spain  2019      21     211    2111
5    2    b    spain  2018      22     222    2222
6    3    c   france  2020      30     300    3000
7    3    c   france  2019      31     311    3111
8    3    c   france  2018      32     322    3222
9    4    d  germany  2020      40     400    4000
10   4    d  germany  2019      41     411    4111
11   4    d  germany  2018      42     422    4222
12   5    e    italy  2020      50     500    5000
13   5    e    italy  2019      51     511    5111
14   5    e    italy  2018      52     522    5222

Alternatively, you could use pivot_longer from pyjanitor:

#pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = ['id', 'name',  'country'], 
                names_to = (".value", "year"), 
                names_pattern = r"(.+)_(\d+)$", 
                sort_by_appearance = True)

    id name  country  year  cat_01  cat_02  cat_03
0    1    a       uk  2020      10     100    1000
1    1    a       uk  2019      11     111    1111
2    1    a       uk  2018      12     122    1222
3    2    b    spain  2020      20     200    2000
4    2    b    spain  2019      21     211    2111
5    2    b    spain  2018      22     222    2222
6    3    c   france  2020      30     300    3000
7    3    c   france  2019      31     311    3111
8    3    c   france  2018      32     322    3222
9    4    d  germany  2020      40     400    4000
10   4    d  germany  2019      41     411    4111
11   4    d  germany  2018      42     422    4222
12   5    e    italy  2020      50     500    5000
13   5    e    italy  2019      51     511    5111
14   5    e    italy  2018      52     522    5222

The .value in names_to retains any part of the column names associated with it (in this case the cat.* prefixes) as column header(s), while the rest goes into the year column. This split is determined by the groups in the names_pattern argument.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • `pyjanitor` seems cool :) Are you the author/contributor of this library? – Shubham Sharma May 24 '21 at 10:53
  • @ShubhamSharma - [Eric Ma](https://github.com/ericmjl) is the author ... i am just a humble contributor ... I like the idea behind the library. In the end, it is just a helper for Pandas :). Would be great if you could contribute as well ... your skills would be most welcome (and I'd learn a little more from you !) – sammywemmy May 24 '21 at 10:58
  • 1
    Oh wow! It seems very useful library. Been busy these days but i'll be glad if could contribute to it as well. I'll check the github repo soon. Thanks (we all learn from each other) – Shubham Sharma May 24 '21 at 11:10