-1

I have a data set about femicides in Brazil. The columns are state, type_of_crime, year, quantity deaths_100K_pop. There are some missing values in quantity and I want to fill these with the mean of the columns quantity but I should do that regarding each year. I dont know exactly how to do it, what way is more eficient. I would like some help, but not the entire solution. Thanks.

I though about using groupby in years and discovering each average per year and then filling the missing values. I though about for loops as well.

1 Answers1

0

For this task you should be using a DataFrame object from the Python Pandas library to manage your data. Using a DataFrame, there are a few ways you could go about your task.

Loop

You can use the groupby() method with a for loop in order to replace the missing values. You're code would be something like:

"""
Disclaimer: Only pseudocode, may not work correctly
"""

import pandas as pd

df = # load the data

# group data by year
grouped = df.groupby('year')

# find the mean for each year using the built-in mean method
means = grouped['quantity'].mean()

# Fill in missing values
for year, mean in means.items():
    m = (df['year'] == year) & (df['quantity'].isna())
    df.loc[m, 'quantity'] = mean

One Liner

You can also solve this in a one-liner using fillna, groupby, transform, and mean.

df['quantity'] = df['quantity'].fillna(df.groupby('year')['quantity'].transform('mean'))

This will most likely be faster than the loops.

alien_jedi
  • 306
  • 3
  • 11