7

I would like to perform a linear interpolation in a variable of a data frame which takes into account the: 1) time difference between the two points, 2) the moment when the data was taken and 3) the individual taken for measure the variable.

For example in the next dataframe:

 df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
            Individuals=c(1,1,1,1,1,1,1,2,2,2),
            Value=c(1, 2, 3, NA, 5, NA, 7, 5, NA, 7))
  df

I would like to obtain:

 result <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                Individuals=c(1,1,1,1,1,1,1,2,2,2),
                Value=c(1, 2, 3, 4, 5, 6, 7, 5, 5.5, 6))
 result

I cannot use exclusively the function na.approx of the package zoo because all observations are not consecutives, some observations belong to one individual and other observations belong to other ones. The reason is because if the second individual would have its first obsrevation with NA and I would use exclusively the function na.approx, I would be using information from the individual==1 to interpolate the NA of the individual==2 (e.g the next data frame would have sucherror)

  df_2 <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                Individuals=c(1,1,1,1,1,1,1,2,2,2),
                Value=c(1, 2, 3, NA, 5, NA, 7, NA, 5, 7))
  df_2

I have tried using the packages zoo and dplyr:

library(dplyr)
library(zoo)
proof <- df %>%
  group_by(Individuals) %>%
  na.approx(df$Value)

But I cannot perform group_by in a zoo object.

Do you know how to interpolate NA values in one variable by groups?

Thanks in advance,

Ruben
  • 493
  • 4
  • 18
  • Could you please elaborate a bit on what your desired output would be? In addition, what is the 'Ear_tag' grouping you are trying for. This column doesn't exist in the dataframe you provided? – Pash101 Nov 13 '15 at 16:03

2 Answers2

10

Use data.frame, rather than cbind to create your data. cbind returns a matrix, but you need a data frame for dplyr. Then use na.approx inside mutate. I've commented out group_by, as you haven't provided the grouping variable in your data, but the approach should work once you've added the grouping variable to the data frame.

df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
            Individuals=c(1,1,1,1,1,1,1,2,2,2),
            Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))

library(dplyr)
library(zoo)

df %>%
  group_by(Individuals) %>%
  mutate(ValueInterp = na.approx(Value, na.rm=FALSE))    
   time Individuals Value ValueInterp
1     1           1    NA          NA
2     2           1     2           2
3     3           1     3           3
4     4           1    NA           4
5     5           1     5           5
6     6           1    NA           6
7     7           1     7           7
8     1           2     8           8
9     2           2    NA           9
10    3           2    10          10

Update: To interpolate multiple columns, we can use mutate_at. Here's an example with two value columns. We use mutate_at to run na.approx on all columns that include "Value" in the column name. list(interp=na.approx) tells mutate_at to generate new column names by running na.approx and adding interp as a suffix to generate the new column names:

df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                 Individuals=c(1,1,1,1,1,1,1,2,2,2),
                 Value1=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10),
                 Value2=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10)*2)

df %>%
  group_by(Individuals) %>%
  mutate_at(vars(matches("Value")), list(interp=na.approx), na.rm=FALSE)
    time Individuals Value1 Value2 Value1_interp Value2_interp
   <dbl>       <dbl>  <dbl>  <dbl>         <dbl>         <dbl>
 1     1           1     NA     NA            NA            NA
 2     2           1      2      4             2             4
 3     3           1      3      6             3             6
 4     4           1     NA     NA             4             8
 5     5           1      5     10             5            10
 6     6           1     NA     NA             6            12
 7     7           1      7     14             7            14
 8     1           2      8     16             8            16
 9     2           2     NA     NA             9            18
10     3           2     10     20            10            20

If you don't want to preserve the original, uninterpolated columns, you can do:

df %>%
  group_by(Individuals) %>%
  mutate_at(vars(matches("Value")), na.approx, na.rm=FALSE)
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Great, It works for the example `df`, but it doesn't works when the `NA` is located in the first or last observation as in the example proposed in `df_2`. Is it possible that the code interpolates at least the data for the observations that aren't the first or last observation? – Ruben Nov 13 '15 at 16:23
  • 1
    Just add `na.rm=FALSE` (see updated code). Then leading and trailing `NA` values will be kept in the resulting vector. – eipi10 Nov 13 '15 at 16:26
  • And how can I perform the interpolation for the entire dataset if I have 1 or 2 observations for one individual and those measures are NAs? For example in this datafrate: `df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3), Individuals=c(3,3,1,1,1,1,1,2,2,2), Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))` I would like to get the interpolation for the rest of data, but when I applied this code R says:`Error in approx(x[!na], y[!na], xout, ...) : need at least two non-NA values to interpolate`. Is there any way to don't take into account the impossible cases? – Ruben Nov 16 '15 at 10:54
  • I want to do the same but for multiple columns, how should modify the code? – msh855 May 13 '19 at 22:18
5

We can use data.table

library(data.table)
library(zoo)
setDT(df1)[, ValueInterp:= na.approx(Value, na.rm=TRUE), by = Individual]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Yes it works in the cases were the `NA` is not located in the first or last obsrevation of one individual, but it doesn't work when the `NA` is the first or last observation. However the worst is that data.table provides a result suppling items (warning message) and performing bad the assignation what is dangerous if you run a big code where you are not able to check all warning messages. – Ruben Nov 13 '15 at 16:30
  • @Ruben For `df_2`, what is the expected output? – akrun Nov 13 '15 at 16:35
  • The output should have the interpolation of all `NAs` except which that are the first or last observation of one individual. – Ruben Nov 13 '15 at 16:39
  • 1
    @Ruben DId you mean that for `df_2`, you don't need the 8th row? By using `na.rm=TRUE`, we get all the NA's replaced except first – akrun Nov 13 '15 at 16:41
  • And how can I perform the interpolation for the entire dataset if I have 1 or 2 observations for one individual and those measures are NAs? For example in this datafrate: `df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3), Individuals=c(3,3,1,1,1,1,1,2,2,2), Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))` I would like to get the interpolation for the rest of data, but when I applied this code R says:`Error in approx(x[!na], y[!na], xout, ...) : need at least two non-NA values to interpolate`. Is there any way to don't take into account the impossible cases? Thanks – Ruben Nov 16 '15 at 10:55
  • @Ruben What would be the expected output (just to cross-reference)? – akrun Nov 16 '15 at 10:58
  • @Ruben Try `setDT(df)[, ValueInterp := if(length(na.omit(Value))<2) na.omit(Value) else na.approx(Value, na.rm=TRUE), Individuals]` – akrun Nov 16 '15 at 11:03
  • Firstly, thank you very much. But it interpolates a value for the first observation of the `Individual==3` when it shouldn't do it because it only has one value to do it. The output that I expect is one where the cases impossible to interpolate (because there are no previous and next data from the `NA`) mantain the `ValueInterp==NA`. With your updated code the first observation of the `Individuals==3` has `ValueInterp==2` and should be `NA`, I think that it is because the code has interpolated the `NA`with one value, the second observation of the `Individuals==3` the`Value==2` – Ruben Nov 16 '15 at 11:21
  • @Ruben Just like you showed the dataset `df`, please show the expected output to avoid confusion. – akrun Nov 16 '15 at 11:22
  • `expected_df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3), Individuals=c(3,3,1,1,1,1,1,2,2,2), Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10), ValueInterp=c(NA,2,3,4,5,6,7,8,9,10)) expected_df` – Ruben Nov 16 '15 at 11:24
  • 1
    @Ruben Try `setDT(df)[, ValueInterp := if(length(na.omit(Value))<2) Value else na.approx(Value, na.rm=TRUE), Individuals]` – akrun Nov 16 '15 at 11:30
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95231/discussion-between-ruben-and-akrun). – Ruben Nov 16 '15 at 11:42