0

I have a dataframe like below:

enter image description here

data = {'ID': [1,2,3,4,5],
        'NODE_ID': [10,10,20,15,20],
        'TYPE': ['a','a','b','a','b'],
        'DATE': ['2021-12-02','2021-12-02','2021-12-02','2021-12-03','2021-12-02'],
        'HOUR': [0,0,3,2,3],
        'EVENTS_COUNT': [10,15,10,21,12]
        
        }

df = pd.DataFrame(data,columns=['ID','NODE_ID', 'TYPE', 'DATE', 'HOUR', 'EVENTS_COUNT'])

I have two different TYPE - a and b. I want to create a dataframe out of this so I have a sum of each TYPE (a and b) for the group of NODE_ID, DATE, HOUR.

The expected output is

enter image description here

Can you please suggest how to do this?

EDIT: Updated the expected output.

nad
  • 2,640
  • 11
  • 55
  • 96
  • 4
    I'm not sure your expected output matches your question? If I group by `NODE_ID`, `DATE` and `HOUR` then shouldn't rows 1 and 3 be in the same group? – tomjn Dec 15 '21 at 18:15
  • To add the above, the EVENTS_COUNT_b in row 1 be 22 and you shouldn't have the last row. – not_speshal Dec 15 '21 at 18:20
  • This is essentially pivot_table: `df.pivot_table(index=['NODE_ID','DATE','HOUR'],columns='TYPE',values='EVENTS_COUNT',aggfunc=sum,fill_value=0)`. Check the link mentioned as dupe for detailed explanation with examples – anky Dec 15 '21 at 18:22
  • My bad, updated the expected output – nad Dec 15 '21 at 18:22

1 Answers1

2

Try with pivot_table:

output = (df.pivot_table("EVENTS_COUNT",["NODE_ID","DATE","HOUR"],"TYPE","sum")
            .fillna(0)
            .add_prefix("EVENTS_COUNT_")
            .reset_index()
            .rename_axis(None, axis=1))

>>> output

TYPE  NODE_ID        DATE  HOUR  EVENTS_COUNT_a  EVENTS_COUNT_b
0          10  2021-12-02     0            25.0             0.0
1          15  2021-12-03     2            21.0             0.0
2          20  2021-12-02     3             0.0            22.0

If there is only one TYPE for each combination of ["NODE_ID", "DATE", "HOUR"], and you would like this as a separate column, you can do:

output["TYPE"] = output.filter(like="EVENTS_COUNT_").idxmax(axis=1).str.lstrip("EVENTS_COUNT_")

>>> output
   NODE_ID        DATE  HOUR  EVENTS_COUNT_a  EVENTS_COUNT_b TYPE
0       10  2021-12-02     0            25.0             0.0    a
1       15  2021-12-03     2            21.0             0.0    a
2       20  2021-12-02     3             0.0            22.0    b
not_speshal
  • 22,093
  • 2
  • 15
  • 30