I have a data set containing donor information for several years, and I need to insert rows where a donor has skipped a year. There are several thousand records in the actual dataframe, but a sample looks like this
import pandas as pd
df = pd.DataFrame([['A','2011',10], ['A','2012',10],['A','2013',10],['B','2011',20],
['B','2013',20]],columns=['donor_id','year','donation'])
df
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
4 B 2013 20
I need to insert a zero donation for donor B for 2012, so it should end up looking like this
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
4 B 2012 0
5 B 2013 20
I've tried several solutions to similar problems, but haven't been successful yet. This solution looks exactly like what I need, but I lose about half the rows of the dataframe and can't figure out why that's happening.
df = pd.read_csv(r'filepath')
df = df.drop_duplicates(subset=['donor_id','year'])
df['year_DT'] = pd.to_datetime(df['year'])
df = (df.set_index('year_DT').
groupby('donor_id').
apply(lambda x: x.asfreq(freq='Y')).
drop('donor_id', axis=1))
df = df.reset_index()
df["Index"] = df.groupby('donor_id').cumcount()+1