1

i am trying to create features from sample that looks like this:

index user product sub_product status
0 u1 p1 sp1 NA
1 u1 p1 sp2 NA
2 u1 p1 sp3 CANCELED
3 u1 p1 sp4 AVAIL
4 u2 p3 sp2 AVAIL
5 u2 p3 sp3 CANCELED
6 u2 p3 sp7 NA

first, i created dummies:

pd.get_dummies(x, columns = ['product', 'sub_product', 'status']

but i also need to group by row, to have 1 row by user, what is the best way to do it?
If i'll just group it:

pd.get_dummies(x, columns = ['product', 'sub_product', 'status'].groupby('user').max()
user product_p1 product_p3 sub_product_sp1 sub_product_sp2 sub_product_sp3 sub_product_sp4 sub_product_sp7 status_AVAIL status_CANCELED status_NA
u1 1 0 1 1 1 1 0 1 1 1
u2 0 1 0 1 1 0 1 1 1 1

i will loose information, fo ex. that for u1 sp3 status is canceled. So it's looks like i have to create dummies for every column combination?

spynal
  • 33
  • 6

1 Answers1

0

Update: You are basically looking for pivot:

out = (df.astype(str)
   .assign(value=1)
   .pivot_table(index=['user'], columns=['product','sub_product','status'],
                values='value', fill_value=0, aggfunc='max')
)

out.columns = ['_'.join(x) for x in out.columns]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • i tried, the same result as i already have with pd.get_dummies(x, columns = ['product', 'sub_product', 'status'].groupby('user').max() I think i need to have columns like sp3_status_AVAIL etc – spynal May 12 '21 at 14:48
  • @spynal I see, I thought that was what you expected. So what is your expected output? It seems like a `pivot` question. – Quang Hoang May 12 '21 at 14:50
  • I think i need also to have columns like sp3_status_AVAIL, sp3_status_NA etc, so basically all combination of existing columns. – spynal May 12 '21 at 14:52