Parsing dates with different formats

Question

I have a csv file with a column containing dates, but the dates are in two different formats: "m/d/y H:M" and "y m d H:M:S". I want to make a new column with these dates all in one format (I don't care which one). I tried the parse_date_time function but it would only work for one of the formats, not both. How can I go about doing this?

Here is the code I was trying to use:

newdata <- mutate(data,
                        newcolumn = parse_date_time(x = data$date_column,
                                                        orders = c("m/d/y H:M", "y m d H:M:S"),
                                                        locale = "eng") )

Here are some example dates from the column:

x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")

can you post the column? it will make it easier to answer and make sure the answer actually works for you — De Novo, Mar 20 '18 at 17:36
@DanHall I just added some sample dates from the column, it's a very large data set so I can't post it all — Melissa, Mar 20 '18 at 17:43
For future reference, use `dput(x)`, where `x` is a small subset of your object that reproduces the problem and has all of the characteristics of your object. Copy and paste the output of `dput` as part of your question. But I think I've provided an answer for you below that should work. — De Novo, Mar 20 '18 at 17:49
OK, NOW it works. That was trickier than I thought it would be :) — De Novo, Mar 20 '18 at 18:09

score 3 · Answer 1 · answered Mar 20 '18 at 18:31

The anytime package does just that -- heuristically evaluating plausible formats:

R> library(anytime)
R> x <- c("6/21/2006 0:00",
+        "1889-06-13 00:00:00",
+        "6/28/2012 0:00",
+        "5/19/2015 0:00",
+        "6/6/2016 0:00",
+        "1884-05-24 00:00:00",
+        "7/28/2013 0:00")
R> anytime(x)
[1] "2006-06-21 CDT" "1889-06-13 CST" "2012-06-28 CDT"
[4] "2015-05-19 CDT" NA               "1884-05-24 CST"
[7] "2013-07-28 CDT"
R>

It uses Boost's date_time library parser by default, and that one does not do single digit month/day, hence the NA on element six. But we also added R's parser as a fallback:

R> anytime(x, useR=TRUE)
[1] "2006-06-21 CDT" "1889-06-13 CST" "2012-06-28 CDT"
[4] "2015-05-19 CDT" "2016-06-06 CDT" "1884-05-24 CST"
[7] "2013-07-28 CDT"
R>

So here is all just works without a single format specification.

score 3 · Answer 2 · answered Jun 22 '18 at 00:11

Using lubridate::parse_date_time():

library(lubridate)
library(dplyr)

x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")

df <- data_frame(date_column = x)
df_new <- df %>% 
  mutate(new_column = parse_date_time(date_column, orders = c('ymdHMS', "mdyHM")))

df_new
# A tibble: 7 x 2
  date_column         new_column         
  <chr>               <dttm>             
1 6/21/2006 0:00      2006-06-21 00:00:00
2 1889-06-13 00:00:00 1889-06-13 00:00:00
3 6/28/2012 0:00      2012-06-28 00:00:00
4 5/19/2015 0:00      2015-05-19 00:00:00
5 6/6/2016 0:00       2016-06-06 00:00:00
6 1884-05-24 00:00:00 1884-05-24 00:00:00
7 7/28/2013 0:00      2013-07-28 00:00:00

De Novo · Answer 3 · 2018-03-20T18:29:12.120

So we start by separating the two:

x <- c("03/20/2018 10:42", "2018-03-20 10:37:02")
DF <- data.frame(x = x, stringsAsFactors = FALSE)
slash_index <- grep("/", DF$x)
slash <- DF$x[slash_index]
dash <- DF$x[-slash_index]

Then we convert them. I like lubridate, but you can use your method if you'd like

library(lubridate)
slash <- mdy_hm(slash)
dash <- ymd_hms(dash)

Then we put them into a date vector:

date_times <- integer(0)
date_times[slash_index] <- slash
date_times[seq_along(DF$x)[-slash_index]] <- dash
DF$x <- as.POSIXct(date_times, origin = "1970-01-01 00:00:00")
DF
#                     x
# 1 2018-03-20 03:42:02
# 2 2018-03-20 03:37:02

Note:
The tricky part here was re-assigning parts of a vector to a vector according to their index. When a portion of a vector was assigned to a POSIXct object, it had its attributes stripped, turning it into the internal integer code for the date time. This was resolved by stripping the attributes at the beginning, and then re-assigning the class at the end.

Here's the full thing with your example:

install.packages("lubridate")
library(lubridate)
x <- c("6/21/2006 0:00",
       "1889-06-13 00:00:00",
       "6/28/2012 0:00",
       "5/19/2015 0:00",
       "6/6/2016 0:00",
       "1884-05-24 00:00:00",
       "7/28/2013 0:00")
DF <- data.frame(x = x, stringsAsFactors = FALSE)
slash_index <- grep("/", DF$x)
slash <- DF$x[slash_index]
dash <- DF$x[-slash_index]


slash <- mdy_hm(slash)
dash <- ymd_hms(dash)


date_times <- integer(0)
date_times[slash_index] <- slash
date_times[seq_along(DF$x)[-slash_index]] <- dash
DF$x <- as.POSIXct(date_times, origin = "1970-01-01 00:00:00", tz = "UTC")
DF
#            x
# 1 2006-06-21
# 2 1889-06-13
# 3 2012-06-28
# 4 2015-05-19
# 5 2016-06-06
# 6 1884-05-24
# 7 2013-07-28

Because the times for these are all "00:00:00", they've been truncated. You can display them with the "00:00:00" using the method described in answers to this question.

Parsing dates with different formats

3 Answers3

Linked

Related