1

I have three vectors of dates in POSIX format that correspond with data collection times from three large datasets. Each of these vectors is of a different length and have similar (but not identical) dates.

I would like to:

  1. group these dates into specified time ranges, e.g. group dates from each vector that fall in a 30-day window and
  2. reduce the number of date groupings to reflect the dataset with the smallest number of collection times, e.g. if "dataset A" has three sampling dates and "dataset B" has five sampling dates then there would only be three groupings of dates (unless the two extra dates in "dataset B" fall within 30 days of the dates in "dataset A").

An example with three vectors of dates in POSIX format (I want to group similar dates between the vectors, allowing a time window of 30 days):

A.dates = as.POSIXlt(c("1998-07-24 08:00","1999-07-24 08:00","2000-07-24 08:00"), 
                     tz = "America/Los_Angeles")
B.dates = as.POSIXlt(c("1998-07-25 08:00","1999-07-25 08:00","2000-07-25 08:00"), 
                     tz = "America/Los_Angeles")
C.dates = as.POSIXlt(c("1998-07-26 08:00","1999-07-26 08:00","2000-07-26 08:00","2000-08-29"), 
                     tz = "America/Los_Angeles") 

Specifying a time window of 30 days, there would be three date groupings (the sampling dates from July 1998, 1999, and 2000). The C.dates vector has a fourth collection date of August 29th, 2000 which would be excluded from the groupings because:

  1. it is not within 30 days of the July dates in the other vectors and
  2. there are no dates in the other two vectors that fall within 30 days of August 29th, 2000.
jay.sf
  • 60,139
  • 8
  • 53
  • 110

1 Answers1

1

You could loop over each element of each vector and create sequences ± 15 days

L <- list(A.dates, B.dates, C.dates)
tmp <- lapply(L, function(x) lapply(x, function(x) 
  do.call(seq, c(as.list(as.Date(x) + c(-15, 15)), "day"))))

and unionite them in the list.

tmp <- lapply(tmp, function(x) as.Date(Reduce(union, x), origin="1970-01-01"))

Then simply find the intersect

i <- Reduce(function(...) as.Date(intersect(...), origin="1970-01-01"), tmp)

and select the dates accordingly.

tmp <- lapply(L, function(x) x[as.Date(x) %in% i])
tmp
# [[1]]
# [1] "1998-07-24 08:00:00 PDT" "1999-07-24 08:00:00 PDT"
# [3] "2000-07-24 08:00:00 PDT"
# 
# [[2]]
# [1] "1998-07-25 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
# [3] "2000-07-25 08:00:00 PDT"
# 
# [[3]]
# [1] "1998-07-26 PDT" "1999-07-26 PDT" "2000-07-26 PDT"

To sort them by year according to your comment, we first unlist them. Unfortunately this converts the dates into numerics (i.e. seconds since January, 1 1970), so we need to convert them back.

tmp <- as.POSIXlt(unlist(lapply(tmp, as.POSIXct)), origin="1970-01-01",
                tz="America/Los_Angeles")

Finally we split the list by the first four substrings which is the year (we also could do split(tmp, strftime(tmp, "%Y")) though).

res <- split(tmp, substr(tmp, 1, 4))
res
# $`1998`
# [1] "1998-07-24 08:00:00 PDT" "1998-07-25 08:00:00 PDT"
# [3] "1998-07-26 00:00:00 PDT"
# 
# $`1999`
# [1] "1999-07-24 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
# [3] "1999-07-26 00:00:00 PDT"
# 
# $`2000`
# [1] "2000-07-24 08:00:00 PDT" "2000-07-25 08:00:00 PDT"
# [3] "2000-07-26 00:00:00 PDT"
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    how could I make the final output into lists of the date groupings, i.e. the 1998 dates would be one list, the 1999 dates would be the next list, etc.? Thanks! Everything so far has worked perfectly – eabrown2378 Dec 18 '20 at 18:05
  • @eabrown2378 Please see edits. Note, that I renamed some objects only needed once to `tmp` to not to clutter the workspace too much. – jay.sf Dec 18 '20 at 18:27
  • Is there a way to group them by the ± 15 days criteria instead of year. Thinking about situations where we have multiple sampling dates in a year, but not within 15 days, i.e. we would need two lists for 1998 – eabrown2378 Dec 18 '20 at 18:35
  • @eabrown2378 I understand, but it's tricky because the vectors in the ±15 list do overlap. – jay.sf Dec 18 '20 at 18:50
  • @eabrown2378 Anyway, the code would be something like `lapply(criteria15, function(x) tmp[as.Date(tmp) %in% x])` (the last `tmp`). – jay.sf Dec 18 '20 at 18:58