I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing data for each user. What I would like to do is fill in the missing data for each user based on the time range for each user(from min.time to max.time in months).
Since, the data pattern falls into a time series, linear interpolation would not make sense. I had also set a multiindex for the data frame based on 'user' and 'date', but 'time' based interpolation doesn't work(as it's not yet been implemented)
x = pd.DataFrame({'user': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','c','c','c','c','c','c','c','c'],'dt':['2015-01-01','2015-02-01','2015-03-01','2015-04-01','2015-05-01','2015-06-01','2015-07-01','2015-08-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01','2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01','2017-06-01','2017-07-01','2017-08-01'], 'val': [1,33,np.nan,1,np.nan,4,2,np.nan,66,2,5,1,np.nan,np.nan,7,5,np.nan,1,np.nan,7,4,np.nan,5,3,np.nan]})
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 NaN
3 a 2015-04-01 1
4 a 2015-05-01 NaN
5 a 2015-06-01 4
6 a 2015-07-01 2
7 a 2015-08-01 NaN
8 b 2016-01-01 66
9 b 2016-02-01 2
10 b 2016-03-01 5
11 b 2016-04-01 1
12 b 2016-05-01 NaN
13 b 2016-06-01 NaN
14 b 2016-07-01 7
15 b 2016-08-01 5
16 b 2016-09-01 NaN
17 c 2017-01-01 1
18 c 2017-02-01 NaN
19 c 2017-03-01 7
20 c 2017-04-01 4
21 c 2017-05-01 NaN
22 c 2017-06-01 5
23 c 2017-07-01 3
24 c 2017-08-01 NaN
In the above dataset, 'val' column has several missing values and I'm figuring out how to automate the process of doing this for several 'users'.
Also, will building a time series model for each user make sense given the no. of data points for each user?
Any input/workaround would be much appreciated.
Thanks, Luc.