1

I am importing data from a csv file where the date column contains dates recorded in different formats. I wish to parse the column so that it has the class date and such that all of the dates are formatted in the same style (i.e %d-%m-%Y). I wish to use lubridate as I have some experience with it and want to get better using it.

I have looked for answers here Parsing dates with different formats and here Parsing dates in multiple formats in R using lubridate but I found the answers incomplete.

Typically when I import csv data I change the col_types like so:

potatoes <- read_csv("data/potato_prices.csv",
           col_types = cols(
           DATE = col_date(format = "%Y-%m-%d"), 
           'M04003DE00BERM372NNBR' = col_double())) %>% 
           rename("Price" = "M04003DE00BERM372NNBR")

but because my DATE column contains dates in different formats, dates not formatted like "%Y-%m-%d" return an NA and the class of the column appears as unknown.

I have tried col_guess, instead of specifying with col_date with the exact date formats and then mutating the DATE column with the following code, but it has not worked as I would like.

potatoes <- read_csv("data/potato_prices.csv",
                      col_types = cols(
                      DATE = col_guess(),
                      'M04003DE00BERM372NNBR' = col_double())) 

potatoes <- potatoes %>% 
  mutate(DATE = parse_date_time(DATE, orders = c("Ymd", "dmY"))) %>%
  rename("Price" = "M04003DE00BERM372NNBR")

Here is an example of how my data appears in excel in csv format

DATE <- c("1879-01-01", "1879-02-01", "1879-03-01", "1879-04-01", "1/05/1990", "1/06/1990", "1/07/1990", "1/08/1990", "1/09/1990", "1/10/1990")
Price <- c("23", "17.9", "17.8", "18", "20", "22", "20", "19", "17.2", "15")

spuds <- data.frame(DATE, Price)

I wish to have a tibble with two columns; DATE as class col_date and Price as class col_double. I will then create plots using ggplot and I think it will be easiest if my DATE column is in class date.

Thanks

cromj006
  • 35
  • 3

1 Answers1

0

The following function will try the several date formats passed in its argument format. It uses lubridate function guess_formats in order to get the possible formats based on that argument.

as_Date <- function(x, format = c("ymd", "dmy", "mdy")){
  fmt <- lubridate::guess_formats(x, format)
  fmt <- unique(fmt)
  y <- as.Date(x, format = fmt[1])
  for(i in seq_along(fmt)[-1]){
    na <- is.na(y)
    if(!any(na)) break
    y[na] <- as.Date(x[na], format = fmt[i])
  }
  y
}

formats <- c("ymd", "dmy")
as_Date(spuds$DATE, formats)
#[1] "1879-01-01" "1879-02-01" "1879-03-01" "1879-04-01"
#[5] "1990-05-01" "1990-06-01" "1990-07-01" "1990-08-01"
#[9] "1990-09-01" "1990-10-01"
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Sorry for the delay in responding - I've only just had time to look at this again. Your code worked but got me to a similar position where the DATE is classed as ``````. I am trying to get it classed as `````` so that when I use ```ggplot``` I can use code such as ```scale_x_date()```. I was hoping there might be an easier way I can parse the DATE column when I read in the CSV file? Thanks for your help – cromj006 Jan 06 '21 at 22:05
  • @cromj006 Class `dttm` is a datetime class, see, for instance, [this SO post](https://stackoverflow.com/questions/58528619/converting-dttm-to-date-formatting-with-as-date-and-as-date-give-different-r). – Rui Barradas Jan 06 '21 at 22:42
  • Thanks for clarifying that and for your help – cromj006 Jan 07 '21 at 10:16