8

My data frame consists of individual observations of individual animals. Each animal has a birthdate, that I would like to associate to the closest field season date from a date vector.

Here is a very basic reproducible example:

ID <- c("a", "b", "c", "d", "a") # individual "a" is measured twice here
birthdate <- as.Date(c("2012-06-12", "2014-06-14", "2015-11-11", "2016-09-30", "2012-06-12"))    
df <- data.frame(ID, birthdate)

# This is the date vector
season_enddates <- as.Date(c("2011-11-10", "2012-11-28", "2013-11-29", "2014-11-26", "2015-11-16", "2016-11-22", "2012-06-21", "2013-06-23", "2014-06-25", "2015-06-08", "2016-06-14"))

With the following code, I can get the difference between the birthdate and the closest season enddate.

for(i in 1:length(df$birthdate)){
  df$birthseason[i] <- which(abs(season_enddates-df$birthdate[i]) == min(abs(season_enddates-df$birthdate[i])))
}

However, what I want is the actual date, and not the difference. For example, the first value of birthseason should be 2012-06-21.

10 Rep
  • 2,217
  • 7
  • 19
  • 33
Mehdi.K
  • 371
  • 4
  • 15
  • Are you using the `lubridate` package? – samkart Jul 13 '17 at 13:17
  • Would you mind editing the post so the object names in the second part (for loop) use the names in the first part of the example. It's unclear where `younger$HatchCalendarYear1` comes from, for example. – Damian Jul 13 '17 at 13:25
  • @Damian I suggested the edits you mention, have a look – JanLauGe Jul 13 '17 at 13:31

5 Answers5

4

It's a bit confusing since you use variables which you didn't include in your examples.

But I think this is what you want:

for (ii in 1:nrow(df))  df$birthseason[ii] <-as.character(season_enddates[which.min(abs(df$birthdate[ii] - season_enddates))])

Alternatively using lapply:

df$birthseason <- unlist(lapply(df$birthdate,function(x) as.character(season_enddates[which.min(abs(x - season_enddates))])))

Result:

> df
  ID  birthdate birthseason
1  a 2012-06-12  2012-06-21
2  b 2014-06-14  2014-06-25
3  c 2015-11-11  2015-11-16
4  d 2016-09-30  2016-11-22
5  a 2012-06-12  2012-06-21
Val
  • 6,585
  • 5
  • 22
  • 52
2

You are looking for which season_enddate is the closest to birthdate[1], and birthdate[2], etc.

To get the data straight, I will create an actual reproducible example:

birthdate <- as.Date(c("2012-06-12", "2014-06-14", 
                       "2015-11-11", "2016-09-30", 
                       "2012-06-12"))

season_enddates <- as.Date(c("2011-11-10", "2012-11-28", 
                             "2013-11-29", "2014-11-26",
                             "2015-11-16", "2016-11-22", 
                             "2012-06-21", "2013-06-23", 
                             "2014-06-25", "2015-06-08", 
                             "2016-06-14"))

Basically I use the function you also used, except that I decided to break it down a bit, so it's easier to follow what you're trying to do:

new.vector <- rep(0, length(birthdate))
for(i in 1:length(birthdate)){
    diffs <- abs(birthdate[i] - season_enddates)
    inds  <- which.min(diffs)
    new.vector[i] <- season_enddates[inds]
}

# new.vector now contains some dates that have been converted to numbers:
as.Date(new.vector, origin = "1970-01-01")
# [1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22"
# [5] "2012-06-21"
KenHBS
  • 6,756
  • 6
  • 37
  • 52
  • code looks a lot cleaner here. Maybe edit the question to use the same layout? Makes it much easier to follow. – JanLauGe Jul 13 '17 at 13:41
  • Sorry, I must have pasted a part of the code I used on my real dataset by accident... It's fixed now. – Mehdi.K Jul 13 '17 at 23:09
2

All solutions here are essentially the same. If you want to have an optimized function doing this operation for you, this is how I'd do it:

match_season <- function(x,y){
  nx <- length(x)
  ind <- numeric(nx)
  for(i in seq_len(nx)){
    ind[i] <- which.min(abs(x[i] - y))
  }
  y[ind]
}

Then you can simply do:

younger$birthseason <- match_season(younger$HatchDate, season_enddates)

Looks cleaner and gives you the desired output in the correct Date format.

Benchmarking:

start <- as.Date("1990-07-01")
end <- as.Date("2017-06-30")

birthdate <- sample(seq(start, end, by = "1 day"), 1000)

season_enddates <- seq(as.Date("1990-12-21"),
                       as.Date("2017-6-21"),
                       by = "3 months")

library(rbenchmark)

benchmark(match_season(birthdate, season_enddates),
          columns = c("test","elapsed"))

gives a timing of 7.62 seconds for 100 replications.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
1

I have suggested some edits to your question, so that your example code produces all variables required to reproduce your problem. Please have a look and check that I understood your problem.

To solve it, i suggest to use which.min (keeps your code a bit simpler and faster), in combination with subsetting of your season_enddates vector, as shown below:

for(i in 1:length(younger$HatchCalendarYear)){
  df$birthseasonDate[i] <- season_enddates[which.min(abs(season_enddates - df$birthdate[i]))]
}
JanLauGe
  • 2,297
  • 2
  • 16
  • 40
1

findInterval is useful in such cases. Locating the nearest season_enddates for each df$birthdate:

vec = sort(season_enddates)
int = findInterval(df$birthdate, vec, all.inside = TRUE)
int
#[1]  1  5  8 10  1

we compare the distance from each of the surrounding dates of the interval and select the minimum:

ans = vec[int]
i = abs(df$birthdate - vec[int]) > abs(df$birthdate - vec[int + 1])
ans[i] = vec[int[i] + 1]
ans
#[1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22" "2012-06-21"
alexis_laz
  • 12,884
  • 4
  • 27
  • 37