Fill in missing values in a data.table using the growth rate by category

Question

I have incomplete (time) series where I would like to fill up missing values using available recent values and growth rates from another series, by category (countries). Categories, missing values are not equal length. This requires applying a function on a variable sequentially: first I need to take the last available data point (which can be anywhere) and divide it by 1+growth rate, then move to the next data point and do the same.

Example dataset and desired outcome:

require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
           year=2010:2013,
           grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))
DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

What I tried: you can do it by a for loop, but that is inefficient and discouraged in R. I came to like the efficiency of data.table, and I would preferably do it in that way. I have tried the shift function of data table, which only fills one missing value (which is logical as it tries to execute at the same time I guess, when the rest is missing the previous value).

DT_example[,values:=ifelse(is.na(values),shift(values,type = "lead")/(1+shift(grwth,type = "lead")),values),by=category]

I gather from other posts that you probably can do it with the rollapply function of the zoo package, but I just got the feeling that I should be able to do it in data table without yet another additional package, and that the solution is relatively simple and elegant, just I am not experienced enough to find it.

This may very well be a duplicate and sorry if I did not notice the appropriate post, but none of what I found did exactly what I want.

This is surprisingly well-formed question with an attempt at solving the problem. Good work! — Happypig375, Nov 27 '18 at 15:15
Thank you @Happypig375. I was a bit afraid that this will be shot down as something that was answered several times ;) — Zsopi, Nov 27 '18 at 15:42
`DT_example<-DT[c(1,2,3,5),values:=NA]` `Error: object 'DT' not found` I am baffled how people upvote without even running the code. It is a well-formed question, but you need to give us a reproducible example. Provide `DT` or give us the output of `dput(DT_example)`. Cheers! p.s. I upvoted too, but after running the code. — M--, Nov 27 '18 at 16:18
https://stackoverflow.com/a/29106679/6461462 look at this particular answer. you may be able to change the function, add another input to it to have the growth rate instead of classic interpolation technique. — M--, Nov 27 '18 at 16:37
I got the same thing @Masoud, I think the line should be: `DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]` — zack, Nov 27 '18 at 17:11
Sorry about the typo in the DT_example, and @zack is right, it should read: DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA] — Zsopi, Nov 27 '18 at 18:24
your 1st paragraph and your sample & required datasets do not seem to tell the same story. for your sample & required datasets, you can do `DT_example[, values := cumprod(1 + replace(grwth, is.na(grwth), 0)), by=.(category)]` — chinsoon12, Nov 28 '18 at 01:31
@chinsoon that does not result in the desired outcome, only replaces values with cumulative growth indices. Maybe my first paragraph was not clear enough, but I want to keep available bits of the "values" variable, and estimate the rest by dividing the oldest available "values" with 1+growth rates. It would be possible to do the division with cumulative growth indices as well, but those should have a values of 1 at the position of the latest available "values", and I could not produce that, due to the varying length of the NAs. — Zsopi, Nov 28 '18 at 08:33

score 1 · Accepted Answer · answered Nov 29 '18 at 16:44

Not sure if this has been solved outside of SO, but it caught my eye the other day. I hadn't written Rcpp in a long time and figured this would be good practice. I know you were looking for a native data.table solution, so feel free to take it or leave it:

Contents of foo.cpp file:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector fillValues(NumericVector vals, NumericVector gRates){

  int n = vals.size();
  NumericVector out(n);

  double currentValue   = vals[n - 1];
  double currentGrowth  = gRates[n - 1];

  // initial assignment
  out[n - 1] = currentValue;

  for(int i = n - 2; i >= 0; i--){

    if(NumericVector::is_na(vals[i])){
      // If val[i] is na, we need prior values to populate it
      if(!((currentValue || currentValue == 0) && (currentGrowth || currentGrowth == 0))){
        // We need a currentValue and currentGrowth to base growth rate on, throw error
        Rcpp::stop("NaN Values for rates or value when needed actual value");
      } else {
        // Update value
        out[i] = currentValue / (1 + currentGrowth);
      }
    } else {
      out[i] = vals[i];
    }

    // update
    currentValue = out[i];
    if(!NumericVector::is_na(gRates[i])){
      currentGrowth = gRates[i];
    }
  }

  return out;
}

/*** R
require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
                       year=2010:2013,
                       grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))

DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

DT_desired[]
DT_example[]

DT_example[, values:= fillValues(values, grwth)][]
*/

Then run it:

> Rcpp::sourceCpp('foo.cpp')

# Removed output that created example data

> DT_desired[]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[]
   category year grwth values
1:        A 2010    NA     NA
2:        A 2011  0.05     NA
3:        A 2012  0.10     NA
4:        A 2013  0.00  1.155
5:        B 2010    NA     NA
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[, values:= fillValues(values, grwth)][]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

Note this runs back to front, so it assumes you want to begin with the most recent recording and work to recordings from further back. It also assumes your dataset is sorted.

thanks for this, I will accept it as an answer, it seems that there is no obvious solution using data.table. I will do it with a regular cycle though, I do not want to learn Rcpp at this point (maybe if the need repeatedly arises in the future), but I will read up on it a bit. Thanks again. — Zsopi, Dec 02 '18 at 15:25

Fill in missing values in a data.table using the growth rate by category

1 Answers1