Joining 2 data frames with all the same columns

Question

In pandas/jupyter notebook with python:

I have a dataframe (df1) with information about the amount of crime per year where each row is a summation of the total amount of crime in that country-year unit. However, df1 does not have rows which contain "0 crime" for the years with 0 crime, and I want to add them.

I thought the easiest way to do this would be to create a blank df, (df2), with the same columns, but with all the country-years in them. Then I could add df1 data through join/merge, and change all the NaN values to 0 in df2 for years with no crime.

df1 looks like this (in excel):

df2 looks like this:

So basically I want to put df1 into the format of df2 so that I have data with all the years with 0 crime as well. I'm new to coding and not really sure how to approach this because I'm not understanding the documentation for .join and .merge. There are 18 countries and years are 2000-2020. Let me know if you have any thoughts!

score 1 · Answer 1 · answered Mar 05 '23 at 06:58

In the beginning, I would make a list of tuples for all the necessary multi-indexes:

# Deleting the multi-index
middleware = df1.reset_index()

# Looking for unique years and countries
years = [i for i in range(middleware['iyear'].min(), middleware['iyear'].max()+1)] #Here I take the minimum and maximum sample year, you can use your own.
countryes = middleware[['country', 'country_txt']].drop_duplicates().values.tolist()

# Combining arrays
uniq = [(countryes[i][0], countryes[i][1], years[j]) for i in range(len(countryes)) for j in range(len(years))]

Then we look at whether there is a record with such an index and, if not, fill it with zeros

for i in uniq:
    try:
        df1.loc[i, :]
    except:
        df1.loc[i] = [0 for i in range(len(df1.columns))]

The entries were added to the end. To group, do this:

df1.groupby(['country', 'country_txt', 'iyear']).sum()

If you want to keep in mind df2 (remove the multi-index), there was an example above how to do this. After all the described actions, add: df1.reset_index(inplace=True) — Vladislav Sokolov, Mar 05 '23 at 07:00
Perhaps there is a more concise solution. So far, I'm only thinking about this option — Vladislav Sokolov, Mar 05 '23 at 07:57

Joining 2 data frames with all the same columns

1 Answers1