pandas dataframe groupby summation

Question

input:

       Date letters numbers mixed         new
0  1/2/2014       a       6    z1  1/2/2014 a
1  1/2/2014       a       3    z1  1/2/2014 a
2  1/3/2014       c       1    x3  1/3/2014 c

I want to groupby new and sum numbers so that the output is:

       Date letters numbers mixed         new
0  1/2/2014       a       9    z1  1/2/2014 a
1  1/3/2014       c       1    x3  1/3/2014 c

I've looked through here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but no luck.

Here is my code:

import pandas
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pandas.DataFrame.from_records(a[1:],columns=a[0])
f=[]
for i in range(0,len(df)):
    f.append(df['Date'][i] + ' ' + df['letters'][i])
df['new']=f

Also, any tricks that will concatenate date and letters without looping thru would also be helpful.

cwharland · Accepted Answer · 2014-05-23T03:30:44.667

1

You have to convert your numbers column to int

import pandas as pd
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
df['new'] = df.Date + " " + df.letters
df.numbers = df.numbers.astype(int)

print df

       Date letters  numbers mixed         new
0  1/2/2014       a        6    z1  1/2/2014 a
1  1/2/2014       a        3    z1  1/2/2014 a
2  1/3/2014       c        1    x3  1/3/2014 c

You can get the dataframe you want to merge with:

df_to_merge = df[df.columns[~df.columns.isin(['numbers'])]].drop_duplicates()

Then you can do your groupby

df_grouped = pd.DataFrame(df.groupby('new').numbers.sum()).reset_index()

To get the result you posted merge

df_result = df_to_merge.merge(df_grouped)
print df_result

       Date letters mixed         new  numbers
0  1/2/2014       a    z1  1/2/2014 a        9
1  1/3/2014       c    x3  1/3/2014 c        1

edited May 23 '14 at 03:30

answered May 23 '14 at 02:57

cwharland

6,275
3
22
29

Thanks for the pointers. Could you please modify your answer for a general solution? ie. what if I had 50 columns instead of the 4 columns in the example. That would much appreciated. – jason May 23 '14 at 03:09
btw, i tried your code and still getting the original input back – jason May 23 '14 at 03:17
You'll have to post your output because I get exactly your desired result. I'll update to make more general. Also...do you really need the column "new" or are you just using it to group by? – cwharland May 23 '14 at 03:21
Thanks for the update. I do need the column 'new', it will be part of the dataset going forward. – jason May 23 '14 at 03:50
One last question on your code. What if 'numbers' had a column name such as 'big numbers'. for the `df_grouped` statement, the `.numbers` would need to be changed. `.big numbers` does obviously does not work. – jason May 25 '14 at 06:22
1

@jason_cant_code you can reference the column with `df['big numbers']` the pandas documentation and intro has all of this syntax info in it – cwharland May 25 '14 at 15:06

pandas dataframe groupby summation

1 Answers1