How to efficiently extrapolate missing data for multiple variables

Question

I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.

###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
  foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000) 
  c(NA,NA,foo)}))
###Simulate national life expectancy values

nat_life <- c(replicate(20L, {
  foo <-  rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
  c(NA,NA,NA,foo)}))




###Construct the data.table       
data.sim <- data.table(  GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)), 
                         DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5), 
                         Year = seq.int(1990,1994,by=1L),
                         National_gdp =  nat_gdp   , 
                         National_life_exp =    nat_life  )

Have you looked into multiple imputation? See 'mice' package, or other related packages. Also, this doesn't seem like a programming question as much as a statistics question. — ndoogan, Mar 25 '13 at 00:22
Never grow vectors within loops, preallocate or use `replicate`. I've edited to show a simpler and more efficient approach — mnel, Mar 25 '13 at 00:32
@ndoogan For the years with missingness, I'm missing observations across most variables, so multiple imputation is not an option. That's why I'm looking for simple linear extrapolation. — Dr. Beeblebrox, Mar 25 '13 at 00:34
Because every time you reassign, you copy the whole thing. It is in the [2nd circle of R hell](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) — mnel, Mar 25 '13 at 00:38
I had never seen `L` before. In case anyone else wants to know, I found this helpful SO discussion: http://stackoverflow.com/questions/22191324/clarification-of-l-in-r — Dr. Beeblebrox, Mar 18 '14 at 15:24

Matthew Lundberg · Accepted Answer · 2013-03-25T02:52:34.590

I assume that you want to do the linear model on each DistrictID separately.

Original data table:

head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990           NA                NA
## 2:            11       1101 1991           NA                NA
## 3:            11       1101 1992     1988.746                NA
## 4:            11       1101 1993     2527.619          54.70739
## 5:            11       1101 1994     3854.210          44.21809
## 6:            11       1102 1990           NA                NA

dd <- copy(data.sim) # Make a copy for later.

Replace NA elements in each with the prediction of a linear model. Two steps in one chained operation.

data.sim[, National_life_exp := ifelse(is.na(National_life_exp), 
                                       predict(lm(National_life_exp ~ Year, data=.SD), .SD),
                                       National_life_exp)
         , by=DistrictID
         ][, National_gdp := ifelse(is.na(National_gdp),
                                    predict(lm(National_gdp ~ Year, data=.SD), .SD),
                                    National_gdp) 
           , by=DistrictID
        ]


head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990    -8.004377          86.17531
## 2:            11       1101 1991   924.727559          75.68601
## 3:            11       1101 1992  1988.745871          65.19670
## 4:            11       1101 1993  2527.618676          54.70739
## 5:            11       1101 1994  3854.209743          44.21809
## 6:            11       1102 1990  1008.886661          70.45643

A nice (?) plot. Note that each level of DistrictID has exactly two non-NA points in this example.

plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.

enter image description here

This is great! Thanks! What is the `.SD` doing? – Dr. Beeblebrox Mar 25 '13 at 02:00 — Dr. Beeblebrox, Mar 25 '13 at 02:00
That is the `S`ubset of the `D`ata selected with `by`. – Matthew Lundberg Mar 25 '13 at 02:02 — Matthew Lundberg, Mar 25 '13 at 02:02

How to efficiently extrapolate missing data for multiple variables

1 Answers1

Linked