3

I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing data for each user. What I would like to do is fill in the missing data for each user based on the time range for each user(from min.time to max.time in months).

Since, the data pattern falls into a time series, linear interpolation would not make sense. I had also set a multiindex for the data frame based on 'user' and 'date', but 'time' based interpolation doesn't work(as it's not yet been implemented)

x = pd.DataFrame({'user': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','c','c','c','c','c','c','c','c'],'dt':['2015-01-01','2015-02-01','2015-03-01','2015-04-01','2015-05-01','2015-06-01','2015-07-01','2015-08-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01','2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01','2017-06-01','2017-07-01','2017-08-01'], 'val': [1,33,np.nan,1,np.nan,4,2,np.nan,66,2,5,1,np.nan,np.nan,7,5,np.nan,1,np.nan,7,4,np.nan,5,3,np.nan]})

user         dt     val
0     a 2015-01-01    1
1     a 2015-02-01   33
2     a 2015-03-01  NaN
3     a 2015-04-01    1
4     a 2015-05-01  NaN
5     a 2015-06-01    4
6     a 2015-07-01    2
7     a 2015-08-01  NaN
8     b 2016-01-01   66
9     b 2016-02-01    2
10    b 2016-03-01    5
11    b 2016-04-01    1
12    b 2016-05-01  NaN
13    b 2016-06-01  NaN
14    b 2016-07-01    7
15    b 2016-08-01    5
16    b 2016-09-01  NaN
17    c 2017-01-01    1
18    c 2017-02-01  NaN
19    c 2017-03-01    7
20    c 2017-04-01    4
21    c 2017-05-01  NaN
22    c 2017-06-01    5
23    c 2017-07-01    3
24    c 2017-08-01  NaN

In the above dataset, 'val' column has several missing values and I'm figuring out how to automate the process of doing this for several 'users'.

Also, will building a time series model for each user make sense given the no. of data points for each user?

Any input/workaround would be much appreciated.

Thanks, Luc.

lucifer
  • 43
  • 4
  • can your show your expected output? – ansev Oct 08 '19 at 10:38
  • `x.groupby('user')['val'].apply(lambda x: x.ffill())`? – ansev Oct 08 '19 at 10:39
  • please, check my answer – ansev Oct 08 '19 at 11:02
  • @ansev I had set a multiindex for the dataframe on 'user' and 'date' and interpolate(method = 'time') was failing. Since it's a time series data, would a linear interpolate or mean interpolation make sense? – lucifer Oct 08 '19 at 11:07
  • I think linear interpolation makes more sense. Because in this way the value depends on the closest values. And no peaks would occur if the user changes his activity sharply. But this depends on the application you want to give – ansev Oct 08 '19 at 11:10
  • Since I have very few data points for each group, i think this should suffice. But' I've been reading that 'trend' and 'seasonality' have to be taken into account while dealing with interpolation for time series. Unfortunately, I havent seen any cases which deals with multiple time series in a single dataframe. – lucifer Oct 08 '19 at 11:14
  • could you average both – ansev Oct 08 '19 at 11:15
  • `x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna((x.interpolate() + x.mean())/2))` – ansev Oct 08 '19 at 11:16
  • even use weights other than 0.5. and adjust these weights also depending on the standard deviation (std). – ansev Oct 08 '19 at 11:17

2 Answers2

3

You could use Groupby + apply to fill in the missing values ​​depending on the user. Without the need to create a series for each user.

Here is an example of how you could fill in this missing data( In addition to these methods you can also create your own function):

1.interpolating for each group with interpolate.Filling with fillna

x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.interpolate()))
print(x)

   user          dt   val
0     a  2015-01-01   1.0
1     a  2015-02-01  33.0
2     a  2015-03-01  17.0
3     a  2015-04-01   1.0
4     a  2015-05-01   2.5
5     a  2015-06-01   4.0
6     a  2015-07-01   2.0
7     a  2015-08-01   2.0
8     b  2016-01-01  66.0
9     b  2016-02-01   2.0
10    b  2016-03-01   5.0
11    b  2016-04-01   1.0
12    b  2016-05-01   3.0
13    b  2016-06-01   5.0
14    b  2016-07-01   7.0
15    b  2016-08-01   5.0
16    b  2016-09-01   5.0
17    c  2017-01-01   1.0
18    c  2017-02-01   4.0
19    c  2017-03-01   7.0
20    c  2017-04-01   4.0
21    c  2017-05-01   4.5
22    c  2017-06-01   5.0
23    c  2017-07-01   3.0
24    c  2017-08-01   3.0

2. ffill method:

x['val']=x.groupby('user')['val'].apply(lambda x: x.ffill())

   user          dt   val
0     a  2015-01-01   1.0
1     a  2015-02-01  33.0
2     a  2015-03-01  33.0
3     a  2015-04-01   1.0
4     a  2015-05-01   1.0
5     a  2015-06-01   4.0
6     a  2015-07-01   2.0
7     a  2015-08-01   2.0
8     b  2016-01-01  66.0
9     b  2016-02-01   2.0
10    b  2016-03-01   5.0
11    b  2016-04-01   1.0
12    b  2016-05-01   1.0
13    b  2016-06-01   1.0
14    b  2016-07-01   7.0
15    b  2016-08-01   5.0
16    b  2016-09-01   5.0
17    c  2017-01-01   1.0
18    c  2017-02-01   1.0
19    c  2017-03-01   7.0
20    c  2017-04-01   4.0
21    c  2017-05-01   4.0
22    c  2017-06-01   5.0
23    c  2017-07-01   3.0
24    c  2017-08-01   3.0

3.Using mean of group. Filling with fillna:

x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.mean()))
print(x)

   user          dt        val
0     a  2015-01-01   1.000000
1     a  2015-02-01  33.000000
2     a  2015-03-01   8.200000
3     a  2015-04-01   1.000000
4     a  2015-05-01   8.200000
5     a  2015-06-01   4.000000
6     a  2015-07-01   2.000000
7     a  2015-08-01   8.200000
8     b  2016-01-01  66.000000
9     b  2016-02-01   2.000000
10    b  2016-03-01   5.000000
11    b  2016-04-01   1.000000
12    b  2016-05-01  14.333333
13    b  2016-06-01  14.333333
14    b  2016-07-01   7.000000
15    b  2016-08-01   5.000000
16    b  2016-09-01  14.333333
17    c  2017-01-01   1.000000
18    c  2017-02-01   4.000000
19    c  2017-03-01   7.000000
20    c  2017-04-01   4.000000
21    c  2017-05-01   4.000000
22    c  2017-06-01   5.000000
23    c  2017-07-01   3.000000
24    c  2017-08-01   4.000000
ansev
  • 30,322
  • 5
  • 17
  • 31
0

That's how I would do.

The assumption is that you know how you want to infer the missing values.

Then I would create the function you use to impute the data, it should be something like that:

def f(x):
    # x is a DataFrame with columns [datetime, value]
    # ...
    # ...
    output = ...
    # output is a dataframe with columns [datetime, value],
    # where value is a column with the data imputed (so without nans)
    return output

then you can just do:

x.groupby(user).apply(f)
Giacomo Sachs
  • 229
  • 1
  • 2
  • 9