42

I have a DataFrame with an index called city_id of cities in the format [city],[state] (e.g., new york,ny containing integer counts in the columns. The problem is that I have multiple rows for the same city, and I want to collapse the rows sharing a city_id by adding their column values. I looked at groupby() but it wasn't immediately obvious how to apply it to this problem.

Edit:

An example: I'd like to change this:

city_id    val1 val2 val3
houston,tx    1    2    0
houston,tx    0    0    1
houston,tx    2    1    1

into this:

city_id    val1 val2 val3
houston,tx    3    3    2

if there are ~10-20k rows.

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
lightlike
  • 937
  • 1
  • 8
  • 12
  • 10
    This isn't really a duplicate and @DSM's answer `df.groupby(df.index)` doesn't appear in the duplicated question. – LondonRob Jun 09 '15 at 17:13

2 Answers2

54

Starting from

>>> df
              val1  val2  val3
city_id                       
houston,tx       1     2     0
houston,tx       0     0     1
houston,tx       2     1     1
somewhere,ew     4     3     7

I might do

>>> df.groupby(df.index).sum()
              val1  val2  val3
city_id                       
houston,tx       3     3     2
somewhere,ew     4     3     7

or

>>> df.reset_index().groupby("city_id").sum()
              val1  val2  val3
city_id                       
houston,tx       3     3     2
somewhere,ew     4     3     7

The first approach passes the index values (in this case, the city_id values) to groupby and tells it to use those as the group keys, and the second resets the index and then selects the city_id column. See this section of the docs for more examples. Note that there are lots of other methods in the DataFrameGroupBy objects, too:

>>> df.groupby(df.index)
<pandas.core.groupby.DataFrameGroupBy object at 0x1045a1790>
>>> df.groupby(df.index).max()
              val1  val2  val3
city_id                       
houston,tx       2     2     1
somewhere,ew     4     3     7
>>> df.groupby(df.index).mean()
              val1  val2      val3
city_id                           
houston,tx       1     1  0.666667
somewhere,ew     4     3  7.000000
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Is there anyway to select the row which has the maximum mean in its group? In this example, it should select `houston,tx 2 1 1` from the first group. – Morteza Milani Sep 26 '17 at 19:56
  • Is there a method where you might define a column to sum, another to max, another to min so the result would be Huston, TX 3 (sum), 2 (max), 1(min)? – MrKingsley May 27 '22 at 17:29
5

Something in the same line. Sorry not the exact replica.

mydata = [{'subid' : 'B14-111', 'age': 75, 'fdg':1.78},
          {'subid' : 'B14-112', 'age': 22, 'fdg':1.56},{'subid' : 'B14-112', 'age': 40, 'fdg':2.00},]
df = pandas.DataFrame(mydata)

gg = df.groupby("subid",sort=True).sum()
LonelySoul
  • 1,212
  • 5
  • 18
  • 45