Grouping several dataframe columns based on another columns values

Question

I have this dataframe:

    refid   col2    price1  factor1 price2  factor2 price3  factor3
  0   1       a      200       1      180     3     150       10
  1   2       b      500       1      450     3     400       10
  2   3       c      700       1      620     2     550        5

And I need to get this output:

   refid    col2    price   factor
0   1        a      200       1
1   1        b      500       1
2   1        c      700       1
3   2        a      180       3
4   2        b      450       3
5   2        c      620       2
6   3        a      150       10
7   3        b      400       10
8   3        c      550       5

Right now I'm trying to use df.melt method, but can't get it to work, this is the code and the current result:

df2_melt = df2.melt(id_vars=["refid","col2"],
        value_vars=["price1","price2","price3",
                   "factor1","factor2","factor3"],
        var_name="Price", 
        value_name="factor")



    refid   col2    price   factor
0       1   a      price1   200
1       2   b      price1   500
2       3   c      price1   700
3       1   a      price2   180
4       2   b      price2   450
5       3   c      price2   620
6       1   a      price3   150
7       2   b      price3   400
8       3   c      price3   550
9       1   a      factor1  1
10      2   b      factor1  1
11      3   c      factor1  1
12      1   a      factor2  3
13      2   b      factor2  3
14      3   c      factor2  2
15      1   a      factor3  10
16      2   b      factor3  10
17      3   c      factor3  5

I think you could use [`pd.wide_to_long`](https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html) — Corralien, Apr 26 '22 at 19:33

score 4 · Accepted Answer · answered Apr 26 '22 at 19:33

4

Since you have a wide DataFrame with common prefixes, you can use wide_to_long:

out = pd.wide_to_long(df, stubnames=['price','factor'], 
                      i=["refid","col2"], j='num').droplevel(-1).reset_index()

Output:

   refid col2  price  factor
0      1    a    200       1
1      1    a    180       3
2      1    a    150      10
3      2    b    500       1
4      2    b    450       3
5      2    b    400      10
6      3    c    700       1
7      3    c    620       2
8      3    c    550       5

Note that your expected output has an error where factors don't align with refids.

answered Apr 26 '22 at 19:33

To match the OP's output, sort by `refid` and `col2` ;) – Apr 26 '22 at 19:38
@enke I'm using your solution with my real dataframe which is a bit more complicated than the example one, and I'm getting the price value as NaN in all the entries. Do you know what might be the problem? (the factor values are ok) – imatiasmb Apr 26 '22 at 21:58
1

@imatiasmb One possibility is that price column names don't have integer suffixes like `price1` but rather `price_1` or `priceA` etc. In `wide_to_long`, the suffixes are assumed to be numeric by default, so this could result in a NaN column (it doesn't have to be numeric; using regex, it can be anything you want). If this is the case, since factor columns are OK, it may be easier if you format the price columns to have the same suffixes as the factor columns. – Apr 26 '22 at 22:55
1

@enke thank you! in fact was a suffixes issue!! – imatiasmb Apr 27 '22 at 14:50

AlirezaAsadi · Answer 2 · 2022-04-26T19:38:13.637

You can melt two times and then concat them:

import pandas as pd  

df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
                   'price1': [200, 500, 700], 'factor1': [1, 1, 1],
                   'price2': [180, 450, 620], 'factor2': [3,3,2],
                   'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)

Here is the output:

     refid  col2  price  factor
0      1    a    200       1
1      2    b    500       1
2      3    c    700       1
3      1    a    180       3
4      2    b    450       3
5      3    c    620       2
6      1    a    150      10
7      2    b    400      10
8      3    c    550       5

Thank you. Nice solution. One'd have to merge both df1 and df2 to attain my desired solution. — imatiasmb, Apr 26 '22 at 19:41

score 1 · Answer 3 · answered Apr 26 '22 at 23:57

One option is pivot_longer from pyjanitor:

# pip install pyjanitor
import janitor
import pandas as pd

(df
.pivot_longer(
    index = ['refid', 'col2'], 
    names_to = '.value', 
    names_pattern = r"(.+)\d", 
    sort_by_appearance = True)
)
   refid col2  price  factor
0      1    a    200       1
1      1    a    180       3
2      1    a    150      10
3      2    b    500       1
4      2    b    450       3
5      2    b    400      10
6      3    c    700       1
7      3    c    620       2
8      3    c    550       5

The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Grouping several dataframe columns based on another columns values

3 Answers3