Hi I am working on a dataset where there is a host_id and two other columns : reviews_per_month and number_of_reviews. For every host_id, majority of the values are present for these two columns whereas some of them are zeros. For each column, I want to replace those 0 values by the mean of all the values related with that host_id. Here is the code I have tried :
def process_rpm_nor(data):
data['reviews_per_month'] = data['reviews_per_month'].fillna(0)
data['number_of_reviews'] = data['number_of_reviews'].fillna(0)
data_list = []
for host_id in set(data['host_id']):
data_temp = data[data['host_id'] == host_id]
nor_non_zero = np.mean(data_temp[data_temp['number_of_reviews'] > 0]['number_of_reviews'])
rpm_non_zero = np.mean(data_temp[data_temp['reviews_per_month'] > 0]['reviews_per_month'])
data_temp['number_of_reviews'] = data_temp['number_of_reviews'].replace(0,nor_non_zero)
data_temp['reviews_per_month'] = data_temp['reviews_per_month'].replace(0,rpm_non_zero)
data_list.append(data_temp)
return pd.concat(data_list, axis = 1)
Though the code works, yet it takes a lot of time to process and I was wondering if anyone could help by offering an alternate solution to this problem or help me optimize my code. I'd really appreciate the help.