3

I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.

###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
  foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000) 
  c(NA,NA,foo)}))
###Simulate national life expectancy values

nat_life <- c(replicate(20L, {
  foo <-  rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
  c(NA,NA,NA,foo)}))




###Construct the data.table       
data.sim <- data.table(  GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)), 
                         DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5), 
                         Year = seq.int(1990,1994,by=1L),
                         National_gdp =  nat_gdp   , 
                         National_life_exp =    nat_life  )
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Dr. Beeblebrox
  • 838
  • 2
  • 13
  • 30
  • Have you looked into multiple imputation? See 'mice' package, or other related packages. Also, this doesn't seem like a programming question as much as a statistics question. – ndoogan Mar 25 '13 at 00:22
  • Never grow vectors within loops, preallocate or use `replicate`. I've edited to show a simpler and more efficient approach – mnel Mar 25 '13 at 00:32
  • @ndoogan For the years with missingness, I'm missing observations across most variables, so multiple imputation is not an option. That's why I'm looking for simple linear extrapolation. – Dr. Beeblebrox Mar 25 '13 at 00:34
  • @mnel Thank you. Why never grow vectors within loops? – Dr. Beeblebrox Mar 25 '13 at 00:35
  • Because every time you reassign, you copy the whole thing. It is in the [2nd circle of R hell](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) – mnel Mar 25 '13 at 00:38
  • I had never seen `L` before. In case anyone else wants to know, I found this helpful SO discussion: http://stackoverflow.com/questions/22191324/clarification-of-l-in-r – Dr. Beeblebrox Mar 18 '14 at 15:24

1 Answers1

4

I assume that you want to do the linear model on each DistrictID separately.

Original data table:

head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990           NA                NA
## 2:            11       1101 1991           NA                NA
## 3:            11       1101 1992     1988.746                NA
## 4:            11       1101 1993     2527.619          54.70739
## 5:            11       1101 1994     3854.210          44.21809
## 6:            11       1102 1990           NA                NA

dd <- copy(data.sim) # Make a copy for later.

Replace NA elements in each with the prediction of a linear model. Two steps in one chained operation.

data.sim[, National_life_exp := ifelse(is.na(National_life_exp), 
                                       predict(lm(National_life_exp ~ Year, data=.SD), .SD),
                                       National_life_exp)
         , by=DistrictID
         ][, National_gdp := ifelse(is.na(National_gdp),
                                    predict(lm(National_gdp ~ Year, data=.SD), .SD),
                                    National_gdp) 
           , by=DistrictID
        ]


head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990    -8.004377          86.17531
## 2:            11       1101 1991   924.727559          75.68601
## 3:            11       1101 1992  1988.745871          65.19670
## 4:            11       1101 1993  2527.618676          54.70739
## 5:            11       1101 1994  3854.209743          44.21809
## 6:            11       1102 1990  1008.886661          70.45643

A nice (?) plot. Note that each level of DistrictID has exactly two non-NA points in this example.

plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.

enter image description here

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112