1

I have the following dataframe:

index   id    code    data    date
0      AZ234  B213    apple   2020-09-01 <- duplicate id, code, data
1      AZ234  B213    apple   2022-02-02 <- duplicate id, code, data
2      AZ234  B213    banana  2020-07-01
3      AZ234  B213    orange  2020-05-11
4      AL612  B309    apple   2020-12-05
5      AL612  B309    banana  2020-07-21
6      AL612  B309    orange  2020-09-21

...

I want to create pivot table to get the following table:

id    code    apple         banana        orange
AZ234  B213   2020-09-01    2020-07-01     2020-05-11
AL612  B309   2020-12-05    2020-07-21     2020-09-21
...

I have tried to do this using pivot_table (pandas):

pd.pivot_table(df, values='date', index=['id','code'],
                       columns=['data'])

but I get this error:

DataError: No numeric types to aggregate

I have read this post but it seems to be a bit different as I don't want to change the columns and also I got error when I tried to set_index with code and id ( " ValueError: Index contains duplicate entries, cannot reshape").

My goal is to create pivot table with dates as values of the table.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Reut
  • 1,555
  • 4
  • 23
  • 55

1 Answers1

1

There are duplicates per id, date, data so is necessary add some aggregate function:

If there are datetimes:

df['date'] = pd.to_datetime(df['date'])

df.pivot_table(values='date', index=['id','code'], columns=['data'], aggfunc='first')

df.pivot_table(values='date', index=['id','code'], columns=['data'], aggfunc='max')

If there are strings:

print (df['date'].dtype)

df.pivot_table(values='date', index=['id','code'], columns=['data'], aggfunc=','.join)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thank you for the super quick answer! I am reading about the aggfunc and not sure how it influence the reults. Can you please elaborate about it? – Reut Mar 16 '22 at 09:54
  • @Reut - added new row with same `id, date, date` in sample data, what is expected ouput? – jezrael Mar 16 '22 at 09:56
  • @excpected data is like appears in the original post ,second table – Reut Mar 16 '22 at 10:05
  • @Reut - so `aggfunc='first'` working well? – jezrael Mar 16 '22 at 10:05
  • yes but confused about what does it do – Reut Mar 16 '22 at 10:11
  • @Reut - I EDIT data in question for generate error if use `pivot` instead `pivot_table`. So ``df.pivot(values='date', index=['id','code'], columns=['data'])` generate error - `ValueError: Index contains duplicate entries`. But if use `pivot_table` there is default aggregate function `mean`, so generate error `DataError: No numeric types to aggregate`, because no number in `Date` column but datetimes. So is necessary change aggregate function - I add possible solutions for get ouput - if duplicates like `AZ234 B213 apple` then date is `first`? Or `max` ? Or `min` ? Or `last` ? – jezrael Mar 16 '22 at 10:16
  • the thing is that there are duplicates because is different fruit in this case. And I don't want to take the max value ot else, I want to keep the id and code and get all the fruits dates for it – Reut Mar 16 '22 at 10:20
  • @Reut - Do you think duplicates per `duplicate id, code, data` ? and different `date`s ? What return `df = df[df.duplicated(subset=['id','code','data'], keep=False)]` ? It get all duplicated rows whcih generate error `alueError: Index contains duplicate entries` – jezrael Mar 16 '22 at 10:23
  • 1
    I just was trying to understand the aggfunc utility, Because as you see, on the original table. I have same id and code repeat so I don'tunderstand in the aggregation what does max or first do – Reut Mar 16 '22 at 12:36