2

I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:

'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'

So, I tried with my own dataset to know why it is not working. I created a file with the following contents

c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1

and I wrote the following script:

data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())

this is what I'm getting as output:

   c0   c1  type
0   1    2     0
1   2    3     0
2   2    4     0
3   1  NaN     1
4   1    3     1
            c0
type          
0     1.666667
1     1.000000

Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?

Really appreciate your help.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
Harshil Modi
  • 397
  • 3
  • 8
  • 15

2 Answers2

3

The problem is that c1, is not of type numeric, do:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())

Output

            c0   c1
type               
0     1.666667  3.0
1     1.000000  3.0

When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):

c0       int64
c1      object
type     int64
dtype: object

If you want to replace the nan, with the mean of the group use transform + fillna:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)

res = data.groupby('type').transform('mean')
print(data.fillna(res))

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  3.0     1
4   1  3.0     1

As a last advise you could read the csv as:

data = pd.read_csv("data.csv", na_values='?')
print(data)

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  NaN     1
4   1  3.0     1

This will save you the need of converting the columns to numeric.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 1
    @HarshilModi Updated the answer showing how to fill na with the mean – Dani Mesejo Dec 29 '20 at 10:45
  • Would this still works for the dataset I found on github? After I replace missing values of `num-of-doors` with 'two'? – Harshil Modi Dec 29 '20 at 10:53
  • I'm not sure I understand. What do you mean 'two'? – Dani Mesejo Dec 29 '20 at 10:54
  • 1
    @HarshilModi I updated the answer with another tip – Dani Mesejo Dec 29 '20 at 10:57
  • It's a value for that column. That column can have value either 'four' or 'two'. – Harshil Modi Dec 29 '20 at 11:25
  • I guess there is no need to type cast (you casted column `c1` as float) data type for [this dataset](https://github.com/nyuvis/datasets/blob/master/auto/imports-85.data). Pandas does that for us. After I change `read_csv('data.csv')` to `read_csv('data.csv', na_values='?')` I was able to get mean for all remaining fields. – Harshil Modi Dec 29 '20 at 12:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226558/discussion-between-dani-mesejo-and-harshil-modi). – Dani Mesejo Dec 29 '20 at 12:45
2
df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))
wwnde
  • 26,119
  • 6
  • 18
  • 32