1

There is another question with a very similar title, but the answer there doesn't seem to apply to my problem.

I have the following dataframe:

                           date  hour  rating category  value
date_time                                                    
2020-02-01 00:00:00  2020-02-01     1    30.0       C1   41.5
2020-02-01 01:00:00  2020-02-01     2    32.4       C1   54.1
2020-02-01 02:00:00  2020-02-01     3    31.1       C1   40.9
2020-02-01 00:00:00  2020-02-01     1     NaN       C2   51.5
2020-02-01 01:00:00  2020-02-01     2     NaN       C2   50.6
2020-02-01 02:00:00  2020-02-01     3     NaN       C2   42.3
2020-02-02 00:00:00  2020-02-02     1    36.8       C1   47.8
2020-02-02 01:00:00  2020-02-02     2    30.1       C1   50.1
2020-02-02 02:00:00  2020-02-02     3    30.5       C1   44.0
2020-02-02 00:00:00  2020-02-02     1     NaN       C2   46.8
2020-02-02 01:00:00  2020-02-02     2     NaN       C2   44.2
2020-02-02 02:00:00  2020-02-02     3     NaN       C2   45.2

Which I would like to transform to this:

                           date  hour  rating  C1    C2   
date_time                                                    
2020-02-01 00:00:00  2020-02-01     1    30.0  41.5  51.5
2020-02-01 01:00:00  2020-02-01     2    32.4  54.1  50.6
2020-02-01 02:00:00  2020-02-01     3    31.1  40.9  42.3
2020-02-02 00:00:00  2020-02-02     1    36.8  47.8  46.8
2020-02-02 01:00:00  2020-02-02     2    30.1  50.1  44.2
2020-02-02 02:00:00  2020-02-02     3    30.5  44.0  45.2

So basically having the value for a certain category shown in a different column, as opposed to each row showing its category.

Based on the question I linked above, I came up with this:

pivoted_df = df.pivot_table(values="value", index=[df.index, "hour", "date", "rating"], columns="category")

But the result of that operation is:

category                                      C1
date_time           hour date       rating      
2020-02-01 00:00:00 1    2020-02-01 30.0    41.5
2020-02-01 01:00:00 2    2020-02-01 32.4    54.1
2020-02-01 02:00:00 3    2020-02-01 31.1    40.9
2020-02-02 00:00:00 1    2020-02-02 36.8    47.8
2020-02-02 01:00:00 2    2020-02-02 30.1    50.1
2020-02-02 02:00:00 3    2020-02-02 30.5    44.0

Which is really not what I'm looking for. And not only that, but I'd be adding hour, date and rating as indexes, which really doesn't look right as I'd like to have those as "normal" columns.

How can I achieve what I mentioned above? Thank you in advance!

Saucy Goat
  • 1,587
  • 1
  • 11
  • 32

1 Answers1

2

One way is to groupby first unstack on category, merge the multilevel columns and rename them.

 g=df.groupby(['date_time', 'date', 'hour','category']).first()\
.unstack('category').reset_index().dropna(axis='columns')

g.columns = [f'{a}{b}' for a, b in g.columns]#Collapse multiindex
g.rename(columns={'ratingC1':'rating', 'valueC1':'C1','valueC2':'C2'}, inplace=True)#Rename columns

        

      date_time                      date     hour rating    C1    C2
0  2020-02-01 00:00:00 00:00:00  2020-02-01     1    30.0  41.5  51.5
1  2020-02-01 01:00:00 01:00:00  2020-02-01     2    32.4  54.1  50.6
2  2020-02-01 02:00:00 02:00:00  2020-02-01     3    31.1  40.9  42.3
3  2020-02-02 00:00:00 00:00:00  2020-02-02     1    36.8  47.8  46.8
4  2020-02-02 01:00:00 01:00:00  2020-02-02     2    30.1  50.1  44.2
5  2020-02-02 02:00:00 02:00:00  2020-02-02     3    30.5  44.0  45.2

Alternatively, use pd.pivot

 import numpy as np
g=pd.pivot_table(df,index=["date_time","date","hour"],\
                 values=["value","rating"],columns=["category"]).reset_index()#Pivot
g.columns = [f'{a}{b}' for a, b in g.columns]#Collapse multiindex
g.rename(columns={'ratingC1':'rating', 'valueC1':'C1','valueC2':'C2'}, inplace=True)#Rename columns
print(g)
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • 2
    Many thanks, you have relieved me of a huge headache. One note though, I mentioned I'd like to keep `date_time` as the index of the dataframe. As such, all I had to do was add `pivoted_df.set_index("date_time", inplace=True)` to use the `date_time` column as an index. – Saucy Goat Aug 16 '20 at 14:23
  • @wwnde Instead of renaming the columns all over again. you can just use this g.columns = [f'{b}' for a, b in g.columns] – The AG Oct 20 '22 at 14:59