2

I am trying to use Pandas pivot_table to move line items on my receipts as columns, leaving only one row per receipt. My main challenge is to name my new columns counting numbers as needed and fill with NaNs.

I am trying to make this:

customer_id receipt_id item_id
01          100       420
05          400       450
05          400       460
05          400       320
05          400       270
02          300       320
02          300       460
05          200       220

...look like this:

customer_id receipt_id   1     2     3     4   n..
01          100          420   NaN   NaN   NaN
05          400          450   460   320   270
02          300          320   460   NaN   NaN
05          200          220   NaN   NaN   Nan

I have been trying to make it work, but I am not close.

pt = pd.pivot_table(df, values=["item_id"], index=["customer_id", "receipt_id"], columns=["item_id"], fill_value="NaN").reset_index()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
user6453877
  • 314
  • 1
  • 4
  • 14

2 Answers2

1

I think you need cumcount for creating column names and to remove [] from parameter values:

df['g'] = df.groupby(['customer_id','receipt_id'])['item_id'].cumcount() + 1

pt = pd.pivot_table(df, 
                    values="item_id",
                    index=["customer_id", "receipt_id"], 
                    columns="g", 
                    fill_value="NaN").reset_index()

print (pt)

g  customer_id  receipt_id    1    2    3    4
0            1         100  420  NaN  NaN  NaN
1            2         300  320  460  NaN  NaN
2            5         200  220  NaN  NaN  NaN
3            5         400  450  460  320  270
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you Jezrael! Your solution worked perfect. On 33M rows, with an average of 2 items per receipt it ran on 20 min using 58 GB RAM. Some receipts had up to 30 items. I dropped all numbered columns >8. – user6453877 Jun 23 '16 at 11:24
  • Glad can help you! Nice day! – jezrael Jun 23 '16 at 11:25
1

alternative solution (it might be slower - i didn't test timing):

In [243]: df.groupby(['customer_id','receipt_id'])['item_id'].apply(list).apply(pd.Series).reset_index()
Out[243]:
   customer_id  receipt_id      0      1      2      3
0            1         100  420.0    NaN    NaN    NaN
1            2         300  320.0  460.0    NaN    NaN
2            5         200  220.0    NaN    NaN    NaN
3            5         400  450.0  460.0  320.0  270.0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419