1

In pandas/jupyter notebook with python:

I have a dataframe (df1) with information about the amount of crime per year where each row is a summation of the total amount of crime in that country-year unit. However, df1 does not have rows which contain "0 crime" for the years with 0 crime, and I want to add them.

I thought the easiest way to do this would be to create a blank df, (df2), with the same columns, but with all the country-years in them. Then I could add df1 data through join/merge, and change all the NaN values to 0 in df2 for years with no crime.

df1 looks like this (in excel):

enter image description here

df2 looks like this: enter image description here

So basically I want to put df1 into the format of df2 so that I have data with all the years with 0 crime as well. I'm new to coding and not really sure how to approach this because I'm not understanding the documentation for .join and .merge. There are 18 countries and years are 2000-2020. Let me know if you have any thoughts!

taraamcl
  • 25
  • 5

1 Answers1

1

In the beginning, I would make a list of tuples for all the necessary multi-indexes:

# Deleting the multi-index
middleware = df1.reset_index()

# Looking for unique years and countries
years = [i for i in range(middleware['iyear'].min(), middleware['iyear'].max()+1)] #Here I take the minimum and maximum sample year, you can use your own.
countryes = middleware[['country', 'country_txt']].drop_duplicates().values.tolist()

# Combining arrays
uniq = [(countryes[i][0], countryes[i][1], years[j]) for i in range(len(countryes)) for j in range(len(years))]

Then we look at whether there is a record with such an index and, if not, fill it with zeros

for i in uniq:
    try:
        df1.loc[i, :]
    except:
        df1.loc[i] = [0 for i in range(len(df1.columns))]

The entries were added to the end. To group, do this:

df1.groupby(['country', 'country_txt', 'iyear']).sum()