pandas: groupby 1 column, sum another, and get rid of duplicate rows

Question

I'm sure this has been asked & answered before but I'm probably phrasing my question wrong.

I have the following DataFrame:

  article day views
0  729910  13   162
1  729910  14   283
2  730855  13     1
3  731449  13     2

I want to have a single row per value in article, and to have a views/total_views column that sums views number for each occurrence of the article in the row.

So the output should be this (day doesn't matter for me here):

  article  views
0  729910  445 (162 + 283)
1  730855  1
2  731449  2

The closest I got is:

parsed_report_df.groupby(['article', 'day'])['views'].sum()

Which yields:

article  day
729910   13     162
         14     283
730855   13       1
731449   13       2
735682   12       1

but I just need the same for views every different day for each article.

that prints `object`. I thought the conversion was made automatically..? — zerohedge, Oct 15 '18 at 12:37
@jezrael - this worked, thanks! can you add as an answer? would be nice to know how this specific group-by expression works. — zerohedge, Oct 15 '18 at 12:41

score 2 · Answer 1 · answered Oct 15 '18 at 13:06

Remove extra column, groupby, sum and reset_index should get you the output

>>> df=pd.DataFrame(data=[[729910, 13, 162],[729910, 14, 283],[730855,13,1],[731449, 13,2]], columns=['article','day', 'views'])

>>> df
   article  day  views
0   729910   13    162
1   729910   14    283
2   730855   13      1
3   731449   13      2

>>> df[['article','views']].groupby('article').sum().reset_index()

   article  views
0   729910    445
1   730855      1
2   731449      2

score 1 · Accepted Answer · answered Oct 15 '18 at 12:51

1

You need convert column to floats or integers first by astype and then aggregate sum by GroupBy.sum:

Solution working with Series - in groupby is also used Series - article column:

 df = (parsed_report_df['views'].astype(float)
                                .groupby(parsed_report_df['article']).sum()
                                .reset_index())
print (df)
  article  views
0  729910  445.0
1  730855    1.0
2  731449    2.0

Another solution with assign back converted valuse of column views:

parsed_report_df['views'] = parsed_report_df['views'].astype(float)
df = parsed_report_df.groupby('article', as_index=False)['views'].sum()
print (df)
  article  views
0  729910  445.0
1  730855    1.0
2  731449    2.0

answered Oct 15 '18 at 12:51

jezrael

822,522
95
1,334
1,252

Thanks. Accepted. Why do I need `as_index` or `reset_index` in these cases? – zerohedge Oct 15 '18 at 12:53
@zerohedge - `as_index=False` is used for return `DataFrame` - convert index values to columns. But fun is it not working always, always working `.reset_imdex()` what do same. – jezrael Oct 15 '18 at 12:54
but it also returns a DataFrame even if `as_index` is `true`? or does it return something else? – zerohedge Oct 15 '18 at 12:56
It return `Series`, if `as_index=True` - it is default value, so should be omited. – jezrael Oct 15 '18 at 12:57

pandas: groupby 1 column, sum another, and get rid of duplicate rows

2 Answers2