Need to insert rows for missing dates for individuals in pandas dataframe

Question

I have a data set containing donor information for several years, and I need to insert rows where a donor has skipped a year. There are several thousand records in the actual dataframe, but a sample looks like this

import pandas as pd
df = pd.DataFrame([['A','2011',10], ['A','2012',10],['A','2013',10],['B','2011',20], 
                   ['B','2013',20]],columns=['donor_id','year','donation'])
df

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2013    20

I need to insert a zero donation for donor B for 2012, so it should end up looking like this

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2012    0
5   B   2013    20

I've tried several solutions to similar problems, but haven't been successful yet. This solution looks exactly like what I need, but I lose about half the rows of the dataframe and can't figure out why that's happening.

df = pd.read_csv(r'filepath')
df = df.drop_duplicates(subset=['donor_id','year'])
df['year_DT'] = pd.to_datetime(df['year'])

df = (df.set_index('year_DT').
      groupby('donor_id').
      apply(lambda x: x.asfreq(freq='Y')).
      drop('donor_id', axis=1))

df = df.reset_index()
df["Index"] = df.groupby('donor_id').cumcount()+1

Andrej Kesely · Answer 1 · 2022-07-02T14:09:47.317

You can .groupby() using donor_id column and on each group apply custom function.

In this function you'll merge actual group with new pd.Series made from range(<min year of this group>, <max year of this group>+1).

Afterwards, the missing rows from this merge (NaNs) are filled with actual values:

def fn(x):
    out = x.merge(
        pd.Series(range(x["year"].min(), x["year"].max() + 1), name="year"),
        how="right",
    )
    out["donor_id"] = out["donor_id"].ffill()
    out["donation"] = out["donation"].fillna(0)
    return out


df["year"] = df["year"].astype(int)
df = df.groupby("donor_id").apply(fn).reset_index(drop=True)
print(df)

Prints:

  donor_id  year  donation
0        A  2011      10.0
1        A  2012      10.0
2        A  2013      10.0
3        B  2011      20.0
4        B  2012       0.0
5        B  2013      20.0

gontxomde · Answer 2 · 2022-07-02T14:22:47.717

I would try generating a new index for your dataframe and then resetting it using reindex

df.year = df.year.astype(int)

years = list(range(df['year'].astype(int).min(), df['year'].astype(int).max()+1))
ids = list(df.donor_id.unique())

new_index = pd.MultiIndex.from_product([ids, years], names=['donor_id', 'year'])

df_new = df.set_index(['donor_id', 'year'])
df_new.reindex(new_index, fill_value=0)
df_new = df_new.reset_index()

# Output:
    donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2012    0
5   B   2013    20

MoRe · Answer 3 · 2022-07-02T22:34:12.187

1

unique =  set(df.year.unique())
data = df.groupby("donor_id").agg({"year":lambda x: unique - set(x)}).explode("year").dropna().reset_index()
data["donation"] = 0
pd.concat([df, data]).sort_values(["donor_id", "year"])

output:

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
0   B   2012    0
4   B   2013    20

edited Jul 02 '22 at 22:34

answered Jul 02 '22 at 14:27

MoRe

2,296
2
3
23

score 1 · Answer 4 · answered Jul 02 '22 at 22:58

One option is with complete from pyjanitor, which abstracts the process for exposing missing rows:

# pip install pyjanitor
import pandas as pd
import janitor
df.complete('donor_id', 'year',fill_value = 0)

  donor_id  year  donation
0        A  2011        10
1        A  2012        10
2        A  2013        10
3        B  2011        20
4        B  2012         0
5        B  2013        20

Need to insert rows for missing dates for individuals in pandas dataframe

4 Answers4