1

I am trying to import in R a text file including datetimes. Times are stored in character format, without timezone information, but we know it is French time (Europe/Paris).

An issue arise for the days of timezone change: e.g. there is a time change from 2018-10-28 03:00:00 CEST to 2018-10-28 02:00:00 CET, thus we have duplicates in our character format, and R cannot tell wether it is CEST or CET.

Consider the following example:

data_in <- "date,val
2018-10-28 01:30:00,25
2018-10-28 02:00:00,26
2018-10-28 02:30:00,27
2018-10-28 02:00:00,28
2018-10-28 02:30:00,29
2018-10-28 03:00:00,30"

library(readr)
data <- read_delim(data_in, ",", locale = locale(tz = "Europe/Paris"))

We end up having duplicates in our dates:

data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CET"  "2018-10-28 02:00:00 CEST"
[5] "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 

Expected output would be:

data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"  "2018-10-28 02:00:00 CET"
[5] "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 

Any idea how to solve the issue (besides telling people to use UTC or ISO formats). I guess the only way is to suppose the dates are sorted, so we can tell the first ones are CEST.

d.b
  • 32,245
  • 6
  • 36
  • 77
jlesuffleur
  • 1,113
  • 1
  • 7
  • 19

1 Answers1

1

If you are certain that your time is always-increasing, then you can look for an apparent decrease (of time-of-day) and manually insert the TZ offset to the string, then parse as usual. I added some logic to look for this decrease only around 2-3am so that if you have multiple days of data spanning midnight, you would not get a false-alarm.

data <- read.csv(text = data_in)
fakedate <- as.POSIXct(gsub("^[-0-9]+ ", "2000-01-01 ", data$date))
decreases <- cumany(grepl(" 0[23]:", data$date) & c(FALSE, diff(fakedate) < 0))
data$date <- paste(data$date, ifelse(decreases, "+0100", "+0200"))
data
#                        date val
# 1 2018-10-28 01:30:00 +0200  25
# 2 2018-10-28 02:00:00 +0200  26
# 3 2018-10-28 02:30:00 +0200  27
# 4 2018-10-28 02:00:00 +0100  28
# 5 2018-10-28 02:30:00 +0100  29
# 6 2018-10-28 03:00:00 +0100  30

as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S %z", tz="Europe/Paris")
# [1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"
# [4] "2018-10-28 02:00:00 CET"  "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 

My use of "2000-01-01" was just some non-DST day so that we can parse the timestamp into POSIXt and calculate a diff on it. (If we didn't insert a date, we could still use as.POSIXct with a format, but if you ever ran this on one of the two DST days, you might get different results since as.POSIXct("01:02:03", format="%H:%M:%S") always assumes "today".

This is obviously a bit fragile with its assumptions, but perhaps it'll be good enough for what you need.

r2evans
  • 141,215
  • 6
  • 77
  • 149