R: Extrapolating x no. of values beyond known values

Question

I'm looking for a function/method to extrapolate (linearly) for an x number of values beyond the original values.

Let's say I start with:

a <- c(NA, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, NA, NA)

And I want to extrapolate two values beyond, I would end up with:

[1] NA NA NA NA -1 0 1 2 3 4 5 NA NA NA NA

What I found so far is the approxExtrap function from Hmisc (https://rdrr.io/cran/Hmisc/man/approxExtrap.html). But since you have to define 'xout', I feel that I have to write a loop and every time select pieces I want to extrapolate on. This is possible of course, but ultimately I expect to have sequences of millions of datapoints with a lot of gaps, so I feel this may be too time consuming. So I hope I'm overlooking a simpler solution.

Added: There are no small gaps in the data, but typically ~ 100 NA's and then ~ 40 datapoints. I would like to extrapolate/extend the 40 datapoints with 5 new datapoints before the start and after the end of the 40 datapoints and replace 5 NA's at both locations. It is not possible to interpolate between two sequences of 40 datapoints.

Are the extrapolation steps always just +/- 1? Do you have multiple runs of non-NA values in the same vector? — MrFlick, Dec 20 '21 at 20:39
Unfortunatally it's not always just +/- 1, let's say: +/- 1 (sd: ~ 0.3). I have multiple runs of NA's indeed (hundreds in one day of data). — Jeroen, Dec 20 '21 at 20:49
The mean change of sequential datapoints is +1 with a standard deviation of ~ 0.3. For context; it's animal movement, so there's some level of autocorrelation, but there can also be a change in state (e.g. change from fast to slow movement). And the animals are only detected in a part of their environment, hence gaps with NA's. — Jeroen, Dec 20 '21 at 21:51
So you have time, x coord, and y coord? And you want to fill in the coords when there is NA based on the time and the previous/next coords? Perhaps something like https://rmisstastic.netlify.app/ will help — jared_mamrot, Dec 20 '21 at 22:32
@jared_mamrot indeed. But the gaps in the data are typically too large to interpolate (the animals may have moved in circles etc). But it would help a lot (for ID'ing individual tracks) to extend/extrapolate pieces of data a bit before the start and after the end. I will check out the link! — Jeroen, Dec 20 '21 at 22:49
What should happen if you have `c(1, 2, 3, NA, 10, 20, 30)`? Should the `NA` be replaced with 4 or 0? Would you prefer to *in*terpolate in this situation? You should specify in your question a complete set of rules that answers should follow. — Mikael Jagan, Dec 21 '21 at 06:11
Thanks for thinking along @Mikael Jagan, I added some details. Hope it's clear now. — Jeroen, Dec 21 '21 at 08:50

Jeroen · Accepted Answer · 2021-12-24T10:52:43.087

I managed to solve the problem by:

Determining the ranges of the different series of data
Define the range I want to extrapolate to
Do the actual extrapolation through the Hmisc package

Initially, I thought I could only manage this by some loops that had to go through the raw data row by row, and was hoping for an existing function.

I'm sure many of you would have coded this way more efficient and nicer. But wanted to post my script anyway for people with a similar problem.

require(Hmisc)
extrapol.length <- 5
test <- data.frame('Time' = c(1:100), # I didn't use this as my data was equally spread in time, if you want to use it, see the first argument in the approxExtrap-function in the secondlast line
                   'x' = c(rep(NA, 10), 1:30, rep(NA, 30), 1:10, rep(NA, 20))) 

## Determine start and end of the continuous (non-NA) data streams
length.values <- diff(c(0, which(is.na(test[,2]))))-2 # length non-NA's
length.values <- length.values[length.values > -1]
length.nas <- diff(c(0, which(!is.na(test[,2])))) # length NA's
length.nas <- length.nas[length.nas > 1]
if(is.na(test[1,2])){
  # data starts with NA
  length.nas <- data.frame('Order' = seq(1, length(length.nas)*2, by = 2),
                           'Length' = length.nas, 'Type' = 'na')
  length.values <- data.frame('Order' = seq(2, length(length.values)*2, by = 2),
                              'Length' = length.values, 'Type' = 'value')
  start.end <- rbind(length.nas, length.values)
  
  start.end <- start.end[order(start.end$Order),]
  
  value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'na'])),
                           'start' = NA, 'end' = NA)
  for(a in value.seqs$no){
    value.seqs$start[a] <- sum(start.end$Length[1:((a*2)-1)])
    value.seqs$end[a] <- sum(start.end$Length[1:(a*2)])
  }
}else{
  # Data starts with actual values
  length.nas <- data.frame('Order' = seq(2, length(length.nas)*2, by = 2),
                           'Length' = length.nas, 'Type' = 'na')
  length.values <- data.frame('Order' = seq(1, length(length.values)*2, by = 2),
                              'Length' = length.values, 'Type' = 'value')
  start.end <- rbind(length.nas, length.values)
  
  start.end <- start.end[order(start.end$Order),]
  
  value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'value'])),
                           'start' = c(1,rep(NA, (length(start.end$Type[start.end$Type == 'value'])-1))), 'end' = NA)
  for(a in value.seqs$no){
    value.seqs$end[a] <- sum(start.end$Length[1:((a*2)-1)])+1
    if(a < max(value.seqs$no))
      value.seqs$start[a+1] <- sum(start.end$Length[1:(a*2)])+1
  }
}

## Do not extrapolate outside of the time-range of the original dataframe
value.seqs$start.extr <- value.seqs$start - extrapol.length
value.seqs$start.extr[value.seqs$start.extr < 1] <- 1 # do not extrapolate below time < 1
value.seqs$end.extr <- value.seqs$end + extrapol.length
value.seqs$end.extr[value.seqs$end.extr > nrow(test) | is.na(value.seqs$end.extr)] <- nrow(test)
value.seqs$end[is.na(value.seqs$end)] <- max(which(!is.na(test[,2])))


## Extrapolate 
for(b in value.seqs$no){
  test[c(value.seqs$start.extr[b]:value.seqs$end.extr[b]),3] <- approxExtrap(value.seqs$start[b]:value.seqs$end[b],test[c(value.seqs$start[b]:value.seqs$end[b]),2],xout=c(value.seqs$start.extr[b]:value.seqs$end.extr[b]))[2]
}

Thanks for thinking along!

R: Extrapolating x no. of values beyond known values

1 Answers1