I am trying to extract dates from text and create a new column in a dataset. Dates are entered in different formats in column A1 (either mm-dd-yy or mm-dd). I need to find a way to identify the date in column A1 and then add the year if it is missing. Thus far, I have been able to extract the date regardless of the format; however, when I use as.Date on the new column A2, the date with mm-dd format becomes <NA>
. I am aware that there might not be a direct solution for this situation, but a workaround (generalizable to a larger data set) would be great. The year would go from September 2019 to August 2020. Additionally, I am not sure why the format I use within the as.Date
function is unable to control how the date gets displayed. This latter issue is not that important, but I am surprised by the behavior of the R function. A solution in tidyverse would be much appreciated.
library(tidyverse)
library(stringr)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+"))
# A1 A2
#1 review 11/18 11/18
#2 begins 12/4/19 12/4/19
#3 3/5/20 3/5/20
#4 <NA> <NA>
#5 deadline 09/5/19 09/5/19
#6 9/3 9/3
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+")) %>%
mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
# A1 A2
# 1 review 11/18 <NA>
# 2 begins 12/4/19 2019-12-04
# 3 3/5/20 2020-03-05
# 4 <NA> <NA>
# 5 deadline 09/5/19 2019-09-05
# 6 9/3 <NA>