4

I have a pandas DataFrame like this:

df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})

In [54]: df
Out[54]: 
    custid   prod1  prod1_hnode1  prod1_hnode2   prod2  prod2_hnode1  \
0       1   jeans             1             6  tshirt             2   
1       2  tshirt             2             7   jeans             1   
2       3  jacket             3             8  jacket             3   
3       4  tshirt             2             7   shirt             4   

   prod2_hnode2  
0             7  
1             6  
2             8  
3             7  

How can I convert this to the following format:

dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})


In [56]: dfnew
Out[56]: 
   custid  hnode1  hnode2   prod     rec
0       1       1       6  prod1   jeans
1       1       2       7  prod2  tshirt
2       2       2       7  prod1  tshirt
3       2       1       6  prod2   jeans
4       3       3       8  prod1  jacket
5       3       3       8  prod2  jacket
6       4       2       7  prod1  tshirt
7       4       4       7  prod2   shirt
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
tushaR
  • 3,083
  • 1
  • 20
  • 33
  • Your question is not clear enough. Have you tried [pivoting](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html) ? – Souradeep Nanda Jan 12 '18 at 10:07
  • 1
    @SouradeepNanda I think pivoting is the opposite of what I was trying to achieve. – tushaR Jan 12 '18 at 10:10

2 Answers2

5

Use:

  • set_index by column custid
  • create MultiIndex in columns by split
  • replace NaNs in columns by rec
  • stack by first level
  • reset_index for columns from MultiIndex
  • rename column

df = df.set_index('custid')
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns={np.nan:'rec'})
cols = ['custid','hnode1','hnode2','prod','rec']
df = df.stack(0).reset_index().rename(columns={'level_1':'prod'}).reindex(columns=cols)
print (df)
   custid  hnode1  hnode2   prod     rec
0       1       1       6  prod1   jeans
1       1       2       7  prod2  tshirt
2       2       2       7  prod1  tshirt
3       2       1       6  prod2   jeans
4       3       3       8  prod1  jacket
5       3       3       8  prod2  jacket
6       4       2       7  prod1  tshirt
7       4       4       7  prod2   shirt
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Here's another approach that should work, but uses repeated melts.

coln = df.dtypes.index  # save some typing
df_long = pd.melt(
    df, id_vars = "custid", value_vars = ["prod1", "prod2"],
    var_name = "prod", value_name = "rec").assign(
    hnode1 = pd.melt(df, id_vars = "custid", 
                     value_vars = filter(lambda x: "hnode1" in x, coln))["value"],
    hnode2 = pd.melt(df, id_vars = "custid", 
                     value_vars = filter(lambda x: "hnode2" in x, coln))["value"])
print(df_long)
   custid   prod     rec  hnode1  hnode2
0       1  prod1   jeans       1       6
1       2  prod1  tshirt       2       7
2       3  prod1  jacket       3       8
3       4  prod1  tshirt       2       7
4       1  prod2  tshirt       2       7
5       2  prod2   jeans       1       6
6       3  prod2  jacket       3       8
7       4  prod2   shirt       4       7

You mention R in your comments. melt from "data.table" should be able to handle this much more easily since you can melt multiple sets of columns at once, similar to how you might approach the problem with base R's reshape function.

The base R approach might be something along the lines of:

reshape(df, direction = "long", idvar = "custid", 
        varying = list(c(2, 5), c(3, 6), c(4, 7)), 
        sep = "", times = c("prod1", "prod2"))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485