0

I have the following DataFrame

VOTES    CITY
 24      A
 22      A
 20      B
 NaN     A
 NaN     A
 30      B
 NaN     C

I need to fill the NaN with mean of values where CITY is 'A' or 'C'

The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.

train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)

The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .

DYZ
  • 55,249
  • 10
  • 64
  • 93
Alfaromeo
  • 379
  • 2
  • 6
  • 14

1 Answers1

1

Use Series.fillna only for filtered rows with mean of filtered rows:

train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean() 
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
   VOTES CITY  VOTES_EN
0   24.0    A        24
1   22.0    A        22
2   20.0    B        20
3    NaN    A        23
4    NaN    A        23
5   30.0    B        30
6    NaN    C        23
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I'm getting 'cannot index multidimensional key' . Just to add 'VOTES' is a categorical column. Possible to create a new column? – Alfaromeo May 08 '19 at 06:56
  • Sorry , but still the same issue , the first line is to extract the numeric from the string variables , actually the column is having values i.e '24 votes' , hence extracted the same and assigned to new column . train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)') m=train.loc[(train['CITY'].isin(['A','C']))] mean = train.loc[m,'VOTES_EN'].astype(int).mean() train.loc[m,'VOTES_EN']=train[m,'VOTES_EN'].fillna(mean) – Alfaromeo May 08 '19 at 07:27
  • @Alfaromeo - What is `print (train.info())` ? – jezrael May 08 '19 at 07:32
  • ` RangeIndex: 12690 entries, 0 to 12689 Data columns (total 10 columns): TITLE 12690 non-null category RESTAURANT_ID 12690 non-null int16 CUISINES 12690 non-null category TIME 12690 non-null category CITY 12578 non-null category LOCALITY 12592 non-null category RATING 12688 non-null category VOTES 11486 non-null category COST 12690 non-null int16 VOTES_EN 11486 non-null object dtypes: category(7), int16(2), object(1)` – Alfaromeo May 08 '19 at 07:35
  • @Alfaromeo - There is main problem in `m=train.loc[(train['CITY'].isin(['A','C']))]` - need `m= train['CITY'].isin(['A','C'])`. Last not 100% sure if need last `train['VOTES_EN'] = train['VOTES_EN'].astype(int)`, because `mean` should in another data return floats, not integers. – jezrael May 08 '19 at 08:01
  • `train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)') m=train['CITY'].isin(['A','C]) train['VOTES_EN'] = pd.to_numeric(train['VOTES_EN']) mean = train.loc[m,'VOTES_EN'].mean() print (mean) train.loc[m,'VOTES_EN']=train[m,'VOTES_EN'].fillna(mean)` The last line now throws the error as "'Series' objects are mutable, thus they cannot be hashed" – Alfaromeo May 08 '19 at 08:43
  • @Alfaromeo You forget cast to float in first row `train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)` – jezrael May 08 '19 at 08:44
  • @Alfaromeo - In last line is lost second `loc` - need `train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)` – jezrael May 08 '19 at 08:54
  • @Alfaromeo - Supeeeeeeeeeeeeer :) I am really happy :) – jezrael May 08 '19 at 09:24