Fastest way for filling-in missing dates for data.table (cont.)

Question

I am searching for an efficient and fast approach to fill missing data in a table with missing dates.

library(data.table)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
                                            "2017-01-01","A","a",1
                                            "2017-02-01","A","b",2
                                            "2017-02-01","B","a",4
                                            "2017-04-01","B","a",5
                                            "2017-05-01","A","b",3')))
dt[,date := as.Date(date)]

Suppose that this table has all the information for x by date and groups gr1 and gr2. I want to fill the missing dates and expand this table by repeating the last known values of x by gr1 and gr2. My approach is as follows:

# define the period to expand
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
dates <- setDT(list(ddate = seq.Date(date_min, date_max,by = 'month')))

# cast the data
dt.c <- dcast(dt, date~gr1+gr2, value.var = "x")
# fill missing dates
dt.c <- dt.c[dates, roll=Inf]

# melt the data to return to original table format
dt.m <- melt(dt.c, id.vars = "date", value.name = "x")

# split column - the slowest part of my code
dt.m[,c("gr1","gr2") := tstrsplit(variable,'_')][,variable:=NULL]

# remove unnecessary NAs
dt.m <- dt.m[complete.cases(dt.m[,x])][,.(date,gr1,gr2,x)]
setkey(dt.m)

This is the output that I expect to see:

> dt.m
         date gr1 gr2 x
1: 2017-01-01   A   a 1
2: 2017-02-01   A   b 2
3: 2017-02-01   B   a 4
4: 2017-03-01   A   b 2
5: 2017-03-01   B   a 4
6: 2017-04-01   B   a 5
7: 2017-05-01   A   b 3
8: 2017-06-01   A   b 3

Now the problem is that tstrsplit is very slow on large data sets with a lot of groups.

This approach is very close to what I need but if I follow it I could not get the desired output as it fills not only the missing dates but the NAs as well. This is my modification of the example:

# the desired dates by group
date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
indx <- dt[,.(date=seq(date_min,date_max,"months")),.(gr1,gr2)]

# key the tables and join them using a rolling join
setkey(dt,gr1,gr2,date)
setkey(indx,gr1,gr2,date)
dt0 <- dt[indx,roll=TRUE][,.(date,gr1,gr2,x)]
setkey(dt0,date)

And this is not the output that I expect to see:

> dt0
          date gr1 gr2  x
 1: 2017-01-01   A   a  1
 2: 2017-01-01   A   b NA
 3: 2017-01-01   B   a NA
 4: 2017-02-01   A   a  1
 5: 2017-02-01   A   b  2
 6: 2017-02-01   B   a  4
 7: 2017-03-01   A   a  1
 8: 2017-03-01   A   b  2
 9: 2017-03-01   B   a  4
10: 2017-04-01   A   a  1
11: 2017-04-01   A   b  2
12: 2017-04-01   B   a  5
13: 2017-05-01   A   a  1
14: 2017-05-01   A   b  3
15: 2017-05-01   B   a  5
16: 2017-06-01   A   a  1
17: 2017-06-01   A   b  3
18: 2017-06-01   B   a  5

What is the best (fastest) way to reproduce my output above (dt.m)?

score 4 · Accepted Answer · answered Mar 06 '19 at 18:38

On rolling join, one 'normal' join and some column switching, aaaand you're done :)

temp <- dates[, near.date := dt[dates, x.date, on = .(date=ddate), roll = TRUE, mult = "first"]][]
dt[temp, on = .(date = near.date)][, date := ddate][,ddate := NULL][]

#          date gr1 gr2 x
# 1: 2017-01-01   A   a 1
# 2: 2017-02-01   A   b 2
# 3: 2017-02-01   B   a 4
# 4: 2017-03-01   A   b 2
# 5: 2017-03-01   B   a 4
# 6: 2017-04-01   B   a 5
# 7: 2017-05-01   A   b 3
# 8: 2017-06-01   A   b 3

You can (of course) make it a one-liner by integrating the first row into the last.

Thanks. It seems to be the fastest way to solve the problem. — Svilen, Mar 07 '19 at 13:00

Frank · Answer 2 · 2019-03-07T13:21:25.023

I'd use IDate and an integer counter for the sequence of dates:

dt[, date := as.IDate(date)]
dates = seq(as.IDate("2017-01-01"), as.IDate("2017-06-01"), by="month")
dDT = data.table(date = dates)[, dseq := .I][]

dt[dDT, on=.(date), dseq := i.dseq]

Then enumerate all desired combos (gr1, gr2, dseq) and do a couple update joins:

cDT = CJ(dseq = dDT$dseq, gr1 = unique(dt$gr1), gr2 = unique(dt$gr2))

cDT[, x := dt[cDT, on=.(gr1, gr2, dseq), x.x]]
cDT[is.na(x), x := dt[copy(.SD), on=.(gr1, gr2, dseq), roll=1L, x.x]]

res = cDT[!is.na(x)]
res[dDT, on=.(dseq), date := i.date]

    dseq gr1 gr2 x       date
 1:    1   A   a 1 2017-01-01
 2:    2   A   a 1 2017-02-01
 3:    2   A   b 2 2017-02-01
 4:    2   B   a 4 2017-02-01
 5:    3   A   b 2 2017-03-01
 6:    3   B   a 4 2017-03-01
 7:    4   B   a 5 2017-04-01
 8:    5   A   b 3 2017-05-01
 9:    5   B   a 5 2017-05-01
10:    6   A   b 3 2017-06-01

There are two extra rows here compared with what the OP expected

res[!dt.m, on=.(date, gr1, gr2)]

   dseq gr1 gr2 x       date
1:    2   A   a 1 2017-02-01
2:    5   B   a 5 2017-05-01

since I am treating each missing gr1 x gr2 value independently, rather than filling it iff the date is not in dt at all (as in the OP). To apply that rule...

drop_rows = res[!dt, on=.(gr1,gr2,date)][date %in% dt$date, .(gr1,gr2,date)]
res[!drop_rows, on=names(drop_rows)]

(The copy(.SD) is needed because of a likely bug.)

Thanks. As you mention your OP has extra rows and it does not solve my problem. Could you correct it so it does not include the extra rows? Please, benchmark it against @Wimpel solution. — Svilen, Mar 07 '19 at 13:06
@Svilen Okay, I've corrected it. I am pretty sure Wimpel's approach is more efficient, and the example is not scalable to a large size (usually, you'd want to make an example that scales as a function of `n` or something in the OP), so I don't have a benchmark. — Frank, Mar 07 '19 at 13:23

chinsoon12 · Answer 3 · 2019-03-07T01:32:57.010

dt should have NA for all unique date for each combi of gr* but is not showing up. Hence, we use CJ and a join to fill those missing dates with NA for x.

After that, expand the dataset for all required ddates.

Finally, filter away rows where x is NA and order by date to make output have the same characteristics as the original dt.

dt[, g := .GRP, .(gr1, gr2)][
    CJ(date=date, g=g, unique=T), on=.(date, g)][, 
        .SD[.(date=ddate), on=.(date), roll=Inf], .(g)][
            !is.na(x)][order(date)]

output:

   g       date gr1 gr2 x
1: 1 2017-01-01   A   a 1
2: 2 2017-02-01   A   b 2
3: 3 2017-02-01   B   a 4
4: 2 2017-03-01   A   b 2
5: 3 2017-03-01   B   a 4
6: 3 2017-04-01   B   a 5
7: 2 2017-05-01   A   b 3
8: 2 2017-06-01   A   b 3

data:

library(data.table)
dt <- fread('date,gr1,gr2,x
    2017-01-01,A,a,1
    2017-02-01,A,b,2
    2017-02-01,B,a,4
    2017-04-01,B,a,5
    2017-05-01,A,b,3')
dt[,date := as.Date(date)] 

date_min <- as.Date('2017-01-01')
date_max <- as.Date('2017-06-01')
ddate = seq.Date(date_min, date_max,by = 'month')

Please try on your actual dataset.

Thanks. The solution is ok but it is much slower compared to @Wimpel approach on my dataset. — Svilen, Mar 07 '19 at 13:02

Soren · Answer 4 · 2019-03-06T17:41:34.690

This is a bit similar to another question, although note precisely a duplicate. The approach is similar, but with data.tables and with multiple columns. See also: Fill in missing date and fill with the data above

Here, it's unclear if you're seeking to fill-in columns gr2 and x or what gr2 is doing. I'm assuming you're seeking to fill-in gaps with dates in 1-month increments. Also as input data's max month is 5 (May) the example desired output has up until 6 (June) so it's unclear how June is reached if the goal is to fill-in between input dates -- but if there's an external maximum, this can be set instead of the max of input dates

library(data.table)
library(tidyr)
dt <- as.data.table(read.csv(textConnection('"date","gr1","gr2","x"
                                            "2017-01-01","A","a",1
                                            "2017-02-01","A","b",2
                                            "2017-02-01","B","a",4
                                            "2017-04-01","B","a",5
                                            "2017-05-01","A","b",3')))
dt[,date := as.Date(date)] 
setkeyv(dt,"date")

all_date_groups <- dt[,list(date=seq.Date(from=min(.SD$date),to=max(.SD$date),by="1 month")),by="gr1"]
setkeyv(all_date_groups,"date")

all_dates_dt <- dt[all_date_groups,on=c("date","gr1")]
setorderv(all_dates_dt,c("gr1","date"))

all_dates_dt <- fill(all_dates_dt,c("gr2","x"))
setorderv(all_dates_dt,c("date","gr1"))
all_dates_dt

Results:

> all_dates_dt
         date gr1 gr2 x
1: 2017-01-01   A   a 1
2: 2017-02-01   A   b 2
3: 2017-02-01   B   a 4
4: 2017-03-01   A   b 2
5: 2017-03-01   B   a 4
6: 2017-04-01   A   b 2
7: 2017-04-01   B   a 5
8: 2017-05-01   A   b 3

Thanks. I am sorry for not being clear in my question. Actually, I use `gr1` and `gr2` as properties of every `x`, so in the end I want to repeat the value of all `x` from previous day regardless of whether it is a gap between dates in the dataset of future days. Also, I don't want to fill the missing day with `x` which does not exist in the previous day. In your case the solution does not fill correctly future dates. If you decide to correct it, please, benchmark it against @Wimpel solution. — Svilen, Mar 07 '19 at 13:20

Fastest way for filling-in missing dates for data.table (cont.)

4 Answers4