Replacing missing values with mean

Question

I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:

'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'

So, I tried with my own dataset to know why it is not working. I created a file with the following contents

c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1

and I wrote the following script:

data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())

this is what I'm getting as output:

   c0   c1  type
0   1    2     0
1   2    3     0
2   2    4     0
3   1  NaN     1
4   1    3     1
            c0
type          
0     1.666667
1     1.000000

Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?

Really appreciate your help.

Dani Mesejo · Accepted Answer · 2020-12-29T10:57:39.193

3

The problem is that c1, is not of type numeric, do:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())

Output

            c0   c1
type               
0     1.666667  3.0
1     1.000000  3.0

When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):

c0       int64
c1      object
type     int64
dtype: object

If you want to replace the nan, with the mean of the group use transform + fillna:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)

res = data.groupby('type').transform('mean')
print(data.fillna(res))

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  3.0     1
4   1  3.0     1

As a last advise you could read the csv as:

data = pd.read_csv("data.csv", na_values='?')
print(data)

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  NaN     1
4   1  3.0     1

This will save you the need of converting the columns to numeric.

edited Dec 29 '20 at 10:57

answered Dec 29 '20 at 10:30

Dani Mesejo

61,499
6
49
76

1

@HarshilModi Updated the answer showing how to fill na with the mean – Dani Mesejo Dec 29 '20 at 10:45
Would this still works for the dataset I found on github? After I replace missing values of `num-of-doors` with 'two'? – Harshil Modi Dec 29 '20 at 10:53
I'm not sure I understand. What do you mean 'two'? – Dani Mesejo Dec 29 '20 at 10:54
1

@HarshilModi I updated the answer with another tip – Dani Mesejo Dec 29 '20 at 10:57
It's a value for that column. That column can have value either 'four' or 'two'. – Harshil Modi Dec 29 '20 at 11:25
I guess there is no need to type cast (you casted column `c1` as float) data type for [this dataset](https://github.com/nyuvis/datasets/blob/master/auto/imports-85.data). Pandas does that for us. After I change `read_csv('data.csv')` to `read_csv('data.csv', na_values='?')` I was able to get mean for all remaining fields. – Harshil Modi Dec 29 '20 at 12:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226558/discussion-between-dani-mesejo-and-harshil-modi). – Dani Mesejo Dec 29 '20 at 12:45

score 2 · Answer 2 · answered Dec 29 '20 at 10:37

2

df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))

answered Dec 29 '20 at 10:37

wwnde

26,119
6
18
32

Replacing missing values with mean

2 Answers2