4

I am working with a multi-year dataset that has columns for date (%Y-%m-%d) and daily values for several variables.
In R, how do I subset the data by a date range (i.e., June 29 +/- 5 days) but capture the data from all years?

               DATE        A       B      C
1996-06-10 12:00:00    178.0    24.1    1.7

1996-06-11 12:00:00    184.1    30.2    1.1

1996-06-12 12:00:00    187.2    29.4    1.8

1996-06-13 12:00:00    194.4    35.0    5.3

1996-06-14 12:00:00    200.3    35.9    1.5

1996-06-15 12:00:00    138.9    15.1    0.0

...
ChrisM
  • 1,576
  • 6
  • 18
  • 29
Mark
  • 43
  • 4

4 Answers4

4

1) Base R

Let yrs be all unique years in the data and targets be each of those years with the target's month and day. Then create dates which contains all dates within delta days of any value in targets. Note that sapply strips dates of its "Date" class but that does not matter since it is only subsequently used in %in% and that ignores the class. Finally subset DF down to those rows whose DATE is in dates. No packages are used.

# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5

DATE <- as.Date(DF$DATE)
yrs <- unique(format(DATE, "%Y"))
targets <- as.Date(paste(yrs, target, sep = "-"))
dates <- c(sapply(targets, "+", seq(-delta, delta)))
DF[DATE %in% dates, ]

giving:

                 DATE     A    B   C
5 1996-06-14 12:00:00 200.3 35.9 1.5
6 1996-06-15 12:00:00 138.9 15.1 0.0

2) sqldf

Alternately, this can be done using a single SQL statement. Note that we assume that the DATE column is character since the question referred to it being in a particular format. Now, using the same inputs the inner select generates target dates from each year and then the outer select joins DF to those rows within delta days of any target date. We use the H2 database backend here since it has better date support than SQLite.

library(sqldf)
library(RH2) 

# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5

fn$sqldf("select DF.* from DF
  join (select distinct cast(substr(DATE, 1, 4) || '-' || '$target' as DATE) as target 
        from DF) 
    on cast(substr(DATE, 1, 10) as DATE) between target - $delta and target + $delta")

giving:

                 DATE     A    B   C
1 1996-06-14 12:00:00 200.3 35.9 1.5
2 1996-06-15 12:00:00 138.9 15.1 0.0

We could simplify the SQL somewhat if DATE is of R's "Date" class. That is, replace the sqldf statement above with:

DF2 <- transform(DF, DATE = as.Date(DATE))
fn$sqldf("select DF2.* from DF2
  join (select distinct cast(year(DATE) || '-' || '$target' as DATE) as target from DF2) 
    on DATE between target - $delta and target + $delta")

giving:

        DATE     A    B   C
1 1996-06-14 200.3 35.9 1.5
2 1996-06-15 138.9 15.1 0.0

Note

The input DF is assumed to be:

DF <- structure(list(DATE = c("1996-06-10 12:00:00", "1996-06-11 12:00:00", 
"1996-06-12 12:00:00", "1996-06-13 12:00:00", "1996-06-14 12:00:00", 
"1996-06-15 12:00:00"), A = c(178, 184.1, 187.2, 194.4, 200.3, 
138.9), B = c(24.1, 30.2, 29.4, 35, 35.9, 15.1), C = c(1.7, 1.1, 
1.8, 5.3, 1.5, 0)), .Names = c("DATE", "A", "B", "C"), row.names = c(NA, 
-6L), class = "data.frame")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

A base R attempt.

Stealing the example data from the other answer by Kevin:

df <- data.frame(
  my_date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
  x = rnorm(3652),
  y = rnorm(3652),
  z = rnorm(3652)
)

Set your variables for the selection:

month_num <- 6
day_num <- 29
bound <- 5

Find the key dates in your range of years:

keydates <- as.Date(sprintf(
  "%d-%02d-%02d", 
  do.call(seq, as.list(as.numeric(range(format(df$my_date, "%Y"))))),
  month_num,
  day_num
))

Make a selection:

out <- df[df$my_date %in% outer(keydates, -bound:bound, `+`),]

Check that it worked:

table(format(out$my_date, "%m-%d"))
#06-24 06-25 06-26 06-27 06-28 06-29 06-30 07-01 07-02 07-03 07-04 
#   10    10    10    10    10    10    10    10    10    10    10

One valid value for each day/month for each year 1990 to 1999, centred on "06-29" with a range of 5 days either side

thelatemail
  • 91,185
  • 12
  • 128
  • 188
1

You can use intervals to provide valid date ranges and then use a map to run each interval over your data to filter.

library(dplyr)
library(lubridate)
library(magrittr)  # only because I've used the "exposition" (%$%) pipe
library(purrr)

df <- tibble(
  my_date = as.POSIXct(
    seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
    tz = "UTC"
  ),
  x = rnorm(3652),
  y = rnorm(3652),
  z = rnorm(3652)
)

month_num <- 6
day_num <- 29
bound <- 5

date_span <- df %>%
  select(my_date) %>%
  filter(month(my_date) == month_num & day(my_date) == day_num) %>%
  mutate(
    start = my_date - days(bound),
    end = my_date + days(bound)
  ) %$%
  interval(start, end, tzone = "UTC")

map_dfr(date_span, ~filter(df, my_date %within% .x))

# # A tibble: 110 x 4
#    my_date                   x       y       z
#    <dttm>                <dbl>   <dbl>   <dbl>
#  1 1990-06-24 10:00:00  0.404   1.33    1.58  
#  2 1990-06-25 10:00:00  0.351  -1.73    0.665 
#  3 1990-06-26 10:00:00 -0.512   1.01    1.72  
#  4 1990-06-27 10:00:00  1.55    0.417  -0.126 
#  5 1990-06-28 10:00:00  1.86    1.18    0.322 
#  6 1990-06-29 10:00:00 -0.0193 -0.105   0.356 
#  7 1990-06-30 10:00:00  0.844  -0.712   1.51  
#  8 1990-07-01 10:00:00 -0.431   0.451  -2.19  
#  9 1990-07-02 10:00:00  1.74   -0.0650 -0.866 
# 10 1990-07-03 10:00:00  0.965  -0.506  -0.0690
# # ... with 100 more rows
Kevin Arseneau
  • 6,186
  • 1
  • 21
  • 40
1

You could also go via the Julian day, which allows you to do basic arithmetic operations (e.g. ± 5 days) without the need to convert back and forth between Date and character objects. Keep in mind that your target date translates into a different Julian day during leap years, so you'll need to extract this piece of information somehow (use lubridate::leap_year if you don't like the base R approach below):

## convert dates to julian day
dat$JULDAY = format(
  dat$DATE
  , "%j"
)

## target date (here 19 june) as julian day
dat$TARGET = ifelse(
  as.integer(
    format(
      dat$DATE
      , "%y"
    )
  ) %% 4 == 0
  , 171 # leap year
  , 170 # common year
)

## create subset
subset(
  dat
  , JULDAY >= (TARGET - 5) & JULDAY <= (TARGET + 5)
  , select = c("DATE", "A", "B", "C")
)

#         DATE     A    B   C
# 5 1996-06-14 12:00:00 200.3 35.9 1.5
# 6 1996-06-15 12:00:00 138.9 15.1 0.0
fdetsch
  • 5,239
  • 3
  • 30
  • 58