40

I'm doing a simple group by operation, trying to compare group means. As you can see below, I have selected specific columns from a larger dataframe, from which all missing values have been removed.

selected columns and df head

But when I group by, I am losing a couple of columns:

group-by logic and resulting df

I have never encountered this with pandas, and I'm not finding anything else on stack overflow that is all that similar. Does anybody have any insight?

Mr. T
  • 11,960
  • 10
  • 32
  • 54
user3334415
  • 473
  • 1
  • 6
  • 7

3 Answers3

42

I think it is Automatic exclusion of 'nuisance' columns, what described here.

Sample:

df = pd.DataFrame({'C': {0: -0.91985400000000006, 1: -0.042379, 2: 1.2476419999999999, 3: -0.00992, 4: 0.290213, 5: 0.49576700000000001, 6: 0.36294899999999997, 7: 1.548106}, 'A': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'foo'}, 'B': {0: 'one', 1: 'one', 2: 'two', 3: 'three', 4: 'two', 5: 'two', 6: 'one', 7: 'three'}, 'D': {0: -1.131345, 1: -0.089328999999999992, 2: 0.33786300000000002, 3: -0.94586700000000001, 4: -0.93213199999999996, 5: 1.9560299999999999, 6: 0.017587000000000002, 7: -0.016691999999999999}})
print (df)
     A      B         C         D
0  foo    one -0.919854 -1.131345
1  bar    one -0.042379 -0.089329
2  foo    two  1.247642  0.337863
3  bar  three -0.009920 -0.945867
4  foo    two  0.290213 -0.932132
5  bar    two  0.495767  1.956030
6  foo    one  0.362949  0.017587
7  foo  three  1.548106 -0.016692

print( df.groupby('A').mean())
            C         D
A                      
bar  0.147823  0.306945
foo  0.505811 -0.344944

I think you can check DataFrame.dtypes.

EliadL
  • 6,230
  • 2
  • 26
  • 43
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @jezraiel. Is there a way to stop this? So the old headers are still up top? – spacedustpi Jul 27 '18 at 13:50
  • 3
    @spacedustpi - Do you think `df.groupby('A', as_index=False).mean()` or `df.groupby('A').mean().reset_index()` ? – jezrael Jul 27 '18 at 13:52
  • This is mycode: df1 = df.groupby(['ORGNTR_NM', 'ORGNTR_BNK_NM', 'BNFCRY_BNK_NM', 'BNFCRY_NM'], as_index=False)['TRNSXN_AMT'].agg(['sum', 'count']). The result is that these four original columns are no longer columns, while 'sum' and 'count' are, and they are above the old columns. I would like them all to be headers. – spacedustpi Jul 27 '18 at 14:00
  • @spacedustpi - How working `df1 = df.groupby(['ORGNTR_NM', 'ORGNTR_BNK_NM', 'BNFCRY_BNK_NM', 'BNFCRY_NM'], as_index=False)['TRNSXN_AMT'].agg(['sum', 'count']).reset_index()` ? – jezrael Jul 27 '18 at 14:01
  • Please see original comment, I edited it. Sorry about that. – spacedustpi Jul 27 '18 at 14:03
  • 1
    That did it! Thanks. – spacedustpi Jul 27 '18 at 14:10
  • I'm finding this issue when I run `df.groupby(mycols, as_index=False).agg(lambda x: ";".join(pd.unique(x)))`. My "disappearing" column has a lot of `NA`, but otherwise strings. I find it strange that this behavior occurs here, but I've also been unlucky in getting a decent working example – Sos Jul 26 '19 at 06:45
  • Also, I want the `NA` to be dropped, just not the rows that also have values – Sos Jul 26 '19 at 06:50
  • 1
    That would've been easier indeed. I ended up doing `df.mycols.fillna('')`, grouping, and then `df.problem_col.replace('^;|;$','')` – Sos Jul 26 '19 at 08:41
  • Aagh. I had a character deep in a CSV file so Pandas "naturally" decided to silently fail to compute .sum() of that column and instead just deleted that column. I'm astonished that this is the default behaviour. – Tunneller May 08 '22 at 23:44
  • Pandas will not perform numerical groupby operations on `object` columns. You can check the column databype with [`DataFrame.dtypes`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html). When pulling data from a SQL database, sometimes the null values come across as `None`, so you will need to replace the `None` values with `np.nan` using `df.fillna(np.nan)` or equivalent. – Foggy Jul 28 '22 at 16:50
  • For anyone running into this: after reading floats from a csv, pandas interpreted one column as object rather than float. I checked with df.dtypes and converted this one column to numeric `df.somecol = pd.to_numeric(df.somecol)` after that everything worked as expected. No other solution here worked for me. – squarespiral Nov 26 '22 at 12:45
12

Try df.groupby(['col_1', 'col_2'], as_index=False).mean(). Use as_index=False to retain column names. Default is True. Above comments have answered this question but posting it as an answer.

Digvijay Sawant
  • 1,049
  • 3
  • 16
  • 32
5

Make sure your column is in numeric/int format and not e.g. as 'O' as Object format. This was one reason it was disapearing for me.

You can check the format of the column by hte code below:

df.column.dtypes
Sami Navesi
  • 160
  • 1
  • 6