2

I am trying to perform a very simple data completion : I have two columns of the same measurement made in two different locations a few miles away from each other. Location 1 is more complete than location 2, and I want to complete the second with the first, by applying a coefficient (loc1/loc2) to 1.

My problem is that it is about weather values, that vary with the day considered, so applying the same mean coefficient to all my values would not be ideal. I first got mean coefficients by the values of loc1 and the day, but for the days with no loc2 values, I would like to :

  • replace NA in 'coef' value by a coef calculated with a same loc1 value
  • but there are sometime more than one coef for each loc1 value, so when there is more than one, it would be preferentially the one measured on the same day
  • and when there is no coef value at all for one loc1 value, then coef would be the same than for that loc1 value +/- an interval (the coef of the closest loc1 value on the closest date to the one to be completed) Here is what it looks like :

      loc1 Date      Coef   loc2
    1      12 204        3      4
    2      8  147        4      2
    3      8  204        NA    NA
    4      10 147        NA    NA
    5      10 158        NA    NA
    6      6  159        3      2
    7      6  162        NA    NA
    8      6  170        2      3
    9      .3 175 0.4833333 0.145
    10     0.3 204        NA    NA
    11     0.4 146        NA    NA
    12     0.4 147        NA    NA
    

I am pretty sure there is a non messy way of doing that, but with my limited knowledge using ifelse or ddply, I didn't get any closer to what I want in the end. I feel a loop could do the trick, but I have no idea how..

Any thoughts would be greatly appreciated ! many thanks in advance !

1 Answers1

0

I don't understand exactly what you mean in bullet 3:

when there is no coef value at all for one loc1 value, then coef would be the same than for that loc1 value +/- an interval (the coef of the closest loc1 value on the closest date to the one to be completed)

So I don't have an answer for that part, but this might get you some of the way there:


library(dplyr)

df <- tibble::tribble(
  ~loc1, ~Date,     ~Coef, ~loc2,
  12,  204L,         3,     4,
  8,  147L,         4,     2,
  8,  204L,        NA,    NA,
  10,  147L,        NA,    NA,
  10,  158L,        NA,    NA,
  6,  159L,         3,     2,
  6,  162L,        NA,    NA,
  6,  170L,         2,     3,
  3,  175L, 0.4833333, 0.145,
  0.3,  204L,        NA,    NA,
  0.4,  146L,        NA,    NA,
  0.4,  147L,        NA,    NA
)

df %>% 
  # Replace Coef with the coef of same loc1 and same day
  group_by(loc1, Date) %>% 
  mutate(Coef = if_else(!is.finite(Coef), mean(Coef, na.rm = TRUE), Coef)) %>% 
  # For ones without same day and loc1, use the average of all days at loc1
  group_by(loc1) %>% 
  mutate(Coef = if_else(!is.finite(Coef), mean(Coef, na.rm = TRUE), Coef)) %>% 
  ungroup() %>% 
  # Then complete the loc2 with using the completed Coef and loc1
  mutate(loc2 = if_else(!is.finite(loc2), loc1 * Coef, loc2))

#> # A tibble: 12 x 4
#>     loc1  Date      Coef   loc2
#>    <dbl> <int>     <dbl>  <dbl>
#>  1  12.0   204 3.0000000  4.000
#>  2   8.0   147 4.0000000  2.000
#>  3   8.0   204 4.0000000 32.000
#>  4  10.0   147       NaN    NaN
#>  5  10.0   158       NaN    NaN
#>  6   6.0   159 3.0000000  2.000
#>  7   6.0   162 2.5000000 15.000
#>  8   6.0   170 2.0000000  3.000
#>  9   3.0   175 0.4833333  0.145
#> 10   0.3   204       NaN    NaN
#> 11   0.4   146       NaN    NaN
#> 12   0.4   147       NaN    NaN
austensen
  • 2,857
  • 13
  • 24
  • Many thanks ! What I meant with my third bullet was something to complete the missing coef with the closest value possible in terms of date and loc1. Take row 4, or 5 in my example : no similar value of loc1, no similar value of Date. So I would have taken the coef from row 2, because 8 is the closest value to 10 in loc1, and 147 is closer to 147 and 158 than 204 for the date, for example. But that leads to many problems about choosing the limits for how different you tolerate loc1 to be, or if there are several possibilities ..So your method with the average loc1 is adopted ! – Caspa Letti May 30 '17 at 18:16
  • Great, glad this was helpful! – austensen May 30 '17 at 18:30