1

So basically I have a dataset somewhat like this dataset iamge

I want to select only 'Havard' rows and get their mean and replace it with missing values in GPA only for 'Havard' and different mean value for Stanford

I tried data['GPA'].fillna(data['GPA'].mean()) but it would just fill it with mean value of the whole column.

KevinJ
  • 31
  • 5
  • Does this answer your question? [How to fillna by groupby outputs in pandas?](https://stackoverflow.com/questions/41680089/how-to-fillna-by-groupby-outputs-in-pandas) – Mustafa Aydın Oct 01 '20 at 11:34

2 Answers2

1

Given a file (yourdata.csv) containing the data outlined in your question, you could use the below example to replace all null values in the GPA column with the mean GPA of the corresponding University for each row.

This solution makes use of df.apply (docs) to apply the same conditional logic to each row within the df.

import pandas as pd
import numpy as np

df = pd.read_csv('yourdata.csv')

df['GPA'].fillna(np.nan, inplace=True) # Ensuring null values in GPA col are np.nan

df['GPA'] = df.apply(
    lambda row: df[df['University'] == row['University']]['GPA'].mean() 
    if np.isnan(row['GPA']) else row['GPA'], axis=1
)

If you only want to perform this logic on rows where University == 'Harvard' (i.e. not fill null GPA values in Stanford, Oxford etc. rows with their respective mean) you could simply add this as a further condition within the lambda applied to each row to ensure that only rows where University == 'Harvard' are affected.

# Same imports, original df and fillna as above snippet
df['GPA'] = df.apply(
    lambda row: df[df['University'] == row['University']]['GPA'].mean() 
    if np.isnan(row['GPA']) and row['University'] == 'Harvard' 
    else row['gpa'], axis=1
)
JPI93
  • 1,507
  • 5
  • 10
0

See if this works

df['GPA']=df[df['University']=='Havard'].fillna(df.mean())
Subasri sridhar
  • 809
  • 5
  • 13