0

I have a dataframe, which contains dates of many years. It looks like this (instead of three years, I have 40 years):

DATES<-c(seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days'))
df<-data.frame(DATES)

for each day I want to add the season. Hereby, Spring should start at the 20th of March, Summer at the 21st of June, Autumn at the 23rd of September, Winter at the 21st of December. These dates stay unchanged over the years.

I came up with the following code, which works (at least, I think so). However, I was wondering, if there isn't are more elegant way, to get what I want.

df$MONTH<-month(df$DATES)
df$DAY<-mday(df$DATES)
df$DAY_PLUS_MONTH<-df$DAY+df$MONTH*100

df <- df %>%
  mutate(SEASON = case_when(
    DAY_PLUS_MONTH %in% 320:620 ~ 'SPRING',
    DAY_PLUS_MONTH %in% 621:922 ~ 'SUMMER',
    DAY_PLUS_MONTH %in% 923:1221 ~ 'AUTUMN',
    TRUE ~ 'WINTER'))
LuckyLuke
  • 11
  • 5
  • 1
    There are a bunch of solutions for this question [here](https://stackoverflow.com/questions/36502140/determine-season-from-date-using-lubridate-in-r) – Till Nov 03 '20 at 16:05

2 Answers2

2

I think this should work for you:

cut(lubridate::yday(df$DATES - lubridate::days(79)), 
    breaks = c(0, 93, 187, 276, Inf), 
    labels = c("Spring", "Summer", "Autumn", "Winter"))
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • I was about to suggest the same thing, using `as.POSIXlt(.)$yday`. But ... doesn't this differ a little on leap-years? – r2evans Nov 03 '20 at 16:10
  • 1
    @r2evans yes, and I did consider this: I guess my thinking was that the date limits are a little arbitrary - it's amazing that plants wait an extra day before growing when it's a leap year :) ... or maybe that the extra simplicity made up for the slight inaccuracy, or I may even just have been a bit lazy... Anyway - you get +1 for doing it properly! – Allan Cameron Nov 03 '20 at 18:05
  • 1
    I've become a corner-case junkie ever since I started working with partners who don't understand the significance of (e.g.) TZ when dealing with logging time-series data streams. It works great until it doesn't. This is certainly the more readable, and if the user is willing to accept the generality (and infrequent inconsistencies), then readability often trumps precision. – r2evans Nov 03 '20 at 20:16
1

Using $yday (whether from lubridate or as.POSIXlt) may give erroneous results for leap-years. I think a safer method is to create a vector of each of those dates across the years, adding one year in each direction (before/after).

I'm using findInterval, but it's close-enough to cut that you could use that method using the same variables here.

season_dates <- as.Date(sort(c(outer(
  do.call(seq.int, as.list(1900 + as.POSIXlt(range(df$DATES) + c(-365, 365))$year)), 
  c("-03-20", "-06-21", "-09-23", "-12-21"),
  paste0))))
season_dates
#  [1] "2016-03-20" "2016-06-21" "2016-09-23" "2016-12-21" "2017-03-20" "2017-06-21" "2017-09-23" "2017-12-21" "2018-03-20"
# [10] "2018-06-21" "2018-09-23" "2018-12-21" "2019-03-20" "2019-06-21" "2019-09-23" "2019-12-21" "2020-03-20" "2020-06-21"
# [19] "2020-09-23" "2020-12-21"
season_names <- rep(c("Spring", "Summer", "Autumn", "Winter"), length.out = length(season_dates))
season_names
#  [1] "Spring" "Summer" "Autumn" "Winter" "Spring" "Summer" "Autumn" "Winter" "Spring" "Summer" "Autumn" "Winter" "Spring"
# [14] "Summer" "Autumn" "Winter" "Spring" "Summer" "Autumn" "Winter"

set.seed(42)
as_tibble(df) %>%
  mutate(SEASON = season_names[ findInterval(DATES, season_dates) ]) %>%
  sample_n(10) %>%
  arrange(DATES)
# # A tibble: 10 x 2
#    DATES      SEASON
#    <date>     <chr> 
#  1 2017-01-24 Winter
#  2 2017-02-18 Winter
#  3 2017-06-14 Spring
#  4 2017-11-17 Autumn
#  5 2017-12-22 Winter
#  6 2018-02-14 Winter
#  7 2018-07-15 Summer
#  8 2018-09-14 Summer
#  9 2018-09-26 Autumn
# 10 2019-06-18 Spring

I sampled the output just to show some variance, as otherwise the first 10 results were all winter. Also, I used as.POSIXlt(.)$year and then had to adjust it since it is 1900-based. lubridate::year would work here, too.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thanks! could you maybe explain to me, what the `set.seed(42)` is for? – LuckyLuke Nov 03 '20 at 17:46
  • It's solely because I used randomness in the display. If I didn't provide that and you used `sample_n(10)`, you would get a different result. By including `set.seed(42)`, you should see the same random sampling for display. For regular use, though, neither `set.seed` nor `sample_n` are needed. – r2evans Nov 03 '20 at 19:13