7

I have a huge pandas dataframe, shaped like this example:

 new_id hour names values
    0   0   mark    5
    0   0   matt    4
    0   0   alex    3
    1   0   roger   2
    1   0   arthur  7
    1   1   alf     8
    2   1   ale     6
    3   1   peter   5
    3   2   tom     2
    4   2   andrew  7

I need to reshape it, so I use pivot_table():

dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values').fillna(0)

so it becomes

        names   ale alex alf andrew arthur mark matt peter roger tom
hour    new_id                                      
0           0   0.0 3.0 0.0 0.0 0.0 5.0 4.0 0.0 0.0 0.0
            1   0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 2.0 0.0
1           1   0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
            2   6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
            3   0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0
2           3   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
            4   0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0
....

By the way, this small example can't reproduce my problem: in the real dataset, when I pivot it, I get some float values which shouldn't be there, since they are the aggregation and the sum of the values of the initial dataset, which are all integers. Not only they are float, but they are also quite far from the exact results.

Why do I get these float values? Is there a better way to get what I want? I don't really want to write by myself a function to sum properly all the values before pivoting the dataframe, since this should be exactly what pivot_table()does.

sato
  • 768
  • 1
  • 9
  • 30

1 Answers1

11

There is problem NaNs, which convert all values to floats so possible solution is add parameter fill_value=0 if input data are integers:

dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values', fill_value=0)
print (dummy)
name         ale  alex  alf  andrew  arthur  mark  matt  peter  roger  tom
hour new_id                                                               
0    0         0     3    0       0       0     5     4      0      0    0
     1         0     0    0       0       7     0     0      0      2    0
1    1         0     0    8       0       0     0     0      0      0    0
     2         6     0    0       0       0     0     0      0      0    0
     3         0     0    0       0       0     0     0      5      0    0
2    3         0     0    0       0       0     0     0      0      0    2
     4         0     0    0       7       0     0     0      0      0    0

Default aggregate function in pivot_table is mean, so is expected at least one float value in output, so it convert all values to floats.

So if change aggregate function to sum all working nice:

dummy = dummy.pivot_table(index=['hour','new_id'],
                          columns='name', 
                          values='values', 
                          fill_value=0, 
                          aggfunc='sum')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Strangely I get integers on the example dataset but I still get floats on the whole dataframe, don't know why. – sato May 15 '18 at 14:02
  • @sato - what is `df.dtypes` of input data? – jezrael May 15 '18 at 14:03
  • For the 'values' column it's int64 – sato May 15 '18 at 14:05
  • I know, and I really have no idea on how to work around this issue. The only thing I know is that it's giving me completely wrong results. Could it be a good idea to compute the sums before pivoting the dataframe? If so, how could I do this? – sato May 15 '18 at 14:12
  • Plus: should I open an issue on pandas github? – sato May 15 '18 at 14:13
  • @sato - hmmm, are data condidental? Because maybe data dependent issue... – jezrael May 15 '18 at 14:15
  • Yes, unfortunately I cannot share data. Can you suggest some reason for data to raise issues? What should I look for? – sato May 15 '18 at 14:21
  • hmm, hard question, if no float column. What about `pivot_table` alternative? `dummy.groupby(['hour','new_id','name'])['values'].mean().unstack(fill_value=0)` – jezrael May 15 '18 at 14:23
  • Exactly the same: still floats, same results – sato May 15 '18 at 14:25
  • @sato - I know what is problem :) You aggregtae by mean, not by sum, so mean are floats :) – jezrael May 15 '18 at 14:28
  • @sato - so need `dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values', fill_value=0, aggfunc='sum') print (dummy)` or `dummy.groupby(['hour','new_id','name'])['values'].sum().unstack(fill_value=0)` – jezrael May 15 '18 at 14:29
  • jeeeez, you fooled me with that mean() in the groupby expression :D I copy-pasted quickly and I totally missed it. You're great :D – sato May 15 '18 at 14:32
  • 1
    @sato - I can get is earlier :) Super, I was really surprised why get it :) – jezrael May 15 '18 at 14:33
  • I really thought that `pivot_table()` used the sum as default function. Write your comment as an answer so I can accept it :) – sato May 15 '18 at 14:33