3

I have a dataframe df:

   Serial_no       date  Index     x    y
           1 2014-01-01      1   2.0  3.0
           1 2014-03-01      2   3.0  3.0
           1 2014-04-01      3   6.0  2.0
           2 2011-03-01      1   5.1  1.3
           2 2011-04-01      2   5.8  0.6
           2 2011-05-01      3   6.5 -0.1
           2 2011-07-01      4   3.0  5.0
           3 2019-10-01      1   7.9 -1.5
           3 2019-11-01      2   8.6 -2.2
           3 2020-01-01      3  10.0 -3.6
           3 2020-02-01      4  10.7 -4.3
           3 2020-03-01      5   4.0  3.0

Notice: The data is grouped by Serial_no and the date is data reported monthly (first of every month). The Index column is set so each consecutive reported date is a consecutive number in the series. The number of reported dates in each group Serial_no are different. The interval of reported dates date are different for each group Serial_no (they don't start or end on the same date for each group).

The problem: There is no reported data for some dates date in the time series. Notice some dates are missing in each Serial_no group. I want to add a row in each group for those missing dates date and have the data reported in x and y columns as 'NaN'.

Example of the dataframe I need:

   Serial_no       date  Index       x       y
           1 2014-01-01      1     2.0     3.0
           1 2014-02-01      2     NaN     NaN
           1 2014-03-01      3     3.0     3.0
           1 2014-04-01      4     6.0     2.0
           2 2011-03-01      1     5.1     1.3
           2 2011-04-01      2     5.8     0.6
           2 2011-05-01      3     6.5    -0.1
           2 2011-06-01      4     NaN     NaN
           2 2011-07-01      5     3.0     5.0
           3 2019-10-01      1     7.9    -1.5
           3 2019-11-01      2     8.6    -2.2
           3 2019-12-01      3     NaN     NaN
           3 2020-01-01      4    10.0    -3.6
           3 2020-02-01      5    10.7    -4.3
           3 2020-03-01      6     4.0     3.0

I know how to replace the blank cells with NaN once the rows with missing dates are inserted, using the following code:

import pandas as pd
import numpy as np

df['x'].replace('', np.nan, inplace=True)
df['y'].replace('', np.nan, inplace=True)

I also know how to reset the index once the rows with missing dates are inserted, using the following code:

df["Index"] = df.groupby("Serial_no",).cumcount('date')

However, I'm unsure how to locate the the missing dates in each group and insert the row for those (monthly reported) dates. Any help is appreciated.

cowboykevin05
  • 85
  • 2
  • 8

2 Answers2

9

Use custom function with DataFrame.asfreq in GroupBy.apply and then reassign Index by GroupBy.cumcount:

df['date'] = pd.to_datetime(df['date'])

df = (df.set_index('date')
        .groupby('Serial_no')
        .apply(lambda x: x.asfreq('MS'))
        .drop('Serial_no', axis=1))
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
print (df)
    Serial_no       date  Index     x    y
0           1 2014-01-01      1   2.0  3.0
1           1 2014-02-01      2   NaN  NaN
2           1 2014-03-01      3   3.0  3.0
3           1 2014-04-01      4   6.0  2.0
4           2 2011-03-01      1   5.1  1.3
5           2 2011-04-01      2   5.8  0.6
6           2 2011-05-01      3   6.5 -0.1
7           2 2011-06-01      4   NaN  NaN
8           2 2011-07-01      5   3.0  5.0
9           3 2019-10-01      1   7.9 -1.5
10          3 2019-11-01      2   8.6 -2.2
11          3 2019-12-01      3   NaN  NaN
12          3 2020-01-01      4  10.0 -3.6
13          3 2020-02-01      5  10.7 -4.3
14          3 2020-03-01      6   4.0  3.0

Alternative solution with DataFrame.reindex:

df['date'] = pd.to_datetime(df['date'])

f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='MS', name='date'))
df = df.set_index('date').groupby('Serial_no').apply(f).drop('Serial_no', axis=1)
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi @jezrael, what if I want to add same range of dates in each group, how would I do that? – Roy May 09 '21 at 04:21
  • @jezrael How about if we want to populate `x`,`y` for missing months with values from previous month, instead of `NaN`? – cph_sto Mar 02 '22 at 09:56
  • 1
    @cph_sto - do you think change `x.asfreq('MS')` to `x.asfreq('MS', method='ffill')` ? – jezrael Mar 02 '22 at 09:58
  • @jezrael Thank you so much for your prompt help, as always :) – cph_sto Mar 02 '22 at 10:13
  • @jezrael On a bigger df, I am getting this error `ValueError: index must be monotonic increasing or decreasing` around `x.asfreq('MS', method='ffill')`. Do you know why this is happening? – cph_sto Mar 02 '22 at 13:14
  • 1
    @cph_sto - It seems data are not sorted, try `df = df.sort_values(['Serial_no','date'])` before solution – jezrael Mar 02 '22 at 13:15
  • @cph_sto - unfortunately no idea :( Need hard part - debug and found group which failed, then looking for dataetimes of group if no some duplicates or what should raise error. – jezrael Mar 02 '22 at 13:20
0

One option is with complete from pyjanitor, which abstracts the process for exposing missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

# create a mapping that is applied across each Serial_no group
new_dates = {'date':lamba d: pd.date_range(d.min(), d.max(), freq='MS')}

(df
.complete(new_dates, by='Serial_no')
.assign(Index = lambda df: df.groupby('Serial_no')
                             .Index
                             .cumcount()
                             .add(1))
)
    Serial_no       date  Index     x    y
0           1 2014-01-01      1   2.0  3.0
1           1 2014-02-01      2   NaN  NaN
2           1 2014-03-01      3   3.0  3.0
3           1 2014-04-01      4   6.0  2.0
4           2 2011-03-01      1   5.1  1.3
5           2 2011-04-01      2   5.8  0.6
6           2 2011-05-01      3   6.5 -0.1
7           2 2011-06-01      4   NaN  NaN
8           2 2011-07-01      5   3.0  5.0
9           3 2019-10-01      1   7.9 -1.5
10          3 2019-11-01      2   8.6 -2.2
11          3 2019-12-01      3   NaN  NaN
12          3 2020-01-01      4  10.0 -3.6
13          3 2020-02-01      5  10.7 -4.3
14          3 2020-03-01      6   4.0  3.0
sammywemmy
  • 27,093
  • 4
  • 17
  • 31