14

As a R novice I'm pulling my hair out trying to debug cryptic R errors. I have csv that containing 150k lines that I load into a data frame named 'date'. I then use lubridate to convert this character column to datetimes in hopes of finding min/max date.

  dates <- csv[c('datetime')]
  dates$datetime <- ymd_hms(dates$datetime)

Running this code I receive the following error message:

Warning message:
3 failed to parse. 

I accept this as the CSV could have some janky dates in there and next run:

min(dates$datetime) 
max(dates$datetime)

Both of these return NA, which I assume is from the few broken dates still stored in the data frame. I've searched around for a quick fix, and have even tried to build a foreach loop to identify the problem dates, but no luck. What would be a simple way to identify the 3 broken dates?

example date format: 2015-06-17 17:10:16 +0000
Korben Dallas
  • 477
  • 1
  • 6
  • 16
  • You can check whether the format is consistent. Also check `?guess_formats` – akrun Feb 12 '16 at 18:42
  • 1
    If you sort on the datetime vector, would the NA's be either at the head or the tail? – lawyeR Feb 12 '16 at 18:55
  • @akrun I attempted to do this in excel by import csv and checking the length of each cell and filtering for anything out of place. This didn't result in anything, is there a way to do this in R? I've attempted guess_formats & parse_date_time and both have failed. – Korben Dallas Feb 12 '16 at 19:07
  • @lawyeR That was it! I sorted the unparsed raw csv, then tail(dates) displayed the three incorrect fields. Thanks for the help! – Korben Dallas Feb 12 '16 at 19:09
  • 4
    You could also find the row numbers of the missing dates with `which(is.na(dates$datetime))`. – Stibu Feb 12 '16 at 19:10
  • @Stibu this is exactly what I was looking for, quick and easy. – Korben Dallas Feb 12 '16 at 19:11

6 Answers6

13

Credit to LawyeR and Stibu from above comments:

  1. I first sorted the raw csv column and did a head() & tail() to find which 3 dates were causing trouble
  2. Alternatively which(is.na(dates$datetime)) was a simple one liner to also find the answer.
Korben Dallas
  • 477
  • 1
  • 6
  • 16
  • 18
    This is great, but doesn't really answer the general question. What if the problem is that the character 'purpleElephant' is in your data? It's not an NA yet is still unparseable. We still need some way to view the warnings that are given by Lubridate. – Monica Heddneck Jan 06 '17 at 00:27
  • The question was about identifying the three broken dates and this accomplishes that perfectly. – Jon Apr 13 '17 at 16:53
  • 12
    But it only accomplishes that because the 3 dates happened to be NAs. I have a vector of 93 dates/datetimes which contains ~17 NAs and am getting "2 failed to parse". So this solution doesn't solve the generic problem, just the problem in OP's case. – dez93_2000 Oct 31 '19 at 18:11
4

Lubridate will throw that error when attempting to parse dates that do not exist because of daylight savings time.

For example:

library(lubridate)
mydate <- strptime('2020-03-08 02:30:00', format = "%Y-%m-%d %H:%M:%S")
ymd_hms(mydate, tz = "America/Denver")

[1] NA
Warning message:
 1 failed to parse. 

My data comes from an unintelligent sensor which does not know about DST, so impossible (but correctly formatted) dates appear in my timeseries.

Eric Krantz
  • 1,854
  • 15
  • 25
2

Here is a simple function that solves the generic problem:

parse_ymd = function(x){
  d=lubridate::ymd(x, quiet=TRUE)
  errors = x[!is.na(x) & is.na(d)]
  if(length(errors)>0){
    cli::cli_warn("Failed to parse some dates: {.val {errors}}")
  }
  d
}

x = c("2014/20/21", "2014/01/01", NA, "2014/01/02", "foobar")
my_date = lubridate::ymd(x)
#> Warning: 2 failed to parse.
my_date = parse_ymd(x)
#> Warning: Failed to parse some dates: "2014/20/21" and "foobar"

Created on 2022-09-29 with reprex v2.0.2

Of course, replace ymd() with whatever you want.

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
  • It eliminates the failure warning but it does not show where the failure happens if there is a lot of data (I changed options(max.print = .Machine$integer.max) to avoid printing limitation however still the message does not appear ). It solved my problem. – yav dat Feb 23 '23 at 07:46
1

If the indices of where lubridate fails are useful to know, you can use a for loop with stopifnot() and print each successful parse.

Make some dates, throw an error in there at a random location.

library(lubridate)
set.seed(1)
my_dates<-as.character(sample(seq(as.Date('1900/01/01'), 
as.Date('2000/01/01'), by="day"), 1000))
my_dates[sample(1:length(my_dates), 1)]<-"purpleElephant"

Now use a for loop and print each successful parse with stopifnot().

for(i in 1:length(my_dates)){
   print(i)
   stopifnot(!is.na(ymd(my_dates[i])))
}

jbz
  • 141
  • 2
  • 7
1

To provide a more generic answer, first filter out the NAs, then try and parse, then filter only the NAs. This will show you the failures. Something like:

dates2 <- dates[!is.na(dates2$datetime)]
dates2$datetime <- ymd_hms(dates2$datetime)

Warning message:
 3 failed to parse.

dates2[is.na(dates2$datetime)]
Tom
  • 4,860
  • 7
  • 43
  • 55
0

Use the truncate argument. The most common type of irregularity in date-time data is the truncation due to rounding or unavailability of the time stamp.

Therefore, try truncated = 1, then potentially go up to truncated = 3:

  dates <- csv[c('datetime')]
  dates$datetime <- ymd_hms(dates$datetime, truncated = 1)