2

I am currently working on a project where I am trying to calculate various data, however, the CSV file I am working with has an anomaly in the column. It contains a date in the format "%d/%m/%y" format followed immediately by a string.

This is repeated throughout the entire column (column is headerless just in case) and what I am currently trying to achieve is replace the date throughout the column with leaving the remaining string only.

My current approach is to use gsub function, which looks as follows:

gsub(".[/]|[/]|[[:digit:]].", " ", dataset column)

This seems to work initially, however when running a head command, it appears to apply this only for the first 6-7 fields and the rest are appearing as NA values.

Is there any limitations to the GSub function if I am working with a column of 3000+ entries or is there something wrong with the logic behind the code to achieve this.

Here is the sample data used for the code:

structure(list(V1 = c("3/3/2005Mitsubishi", "3/4/2006Jaguar", 
"13/2/2007Land Rover", "12/12/2009Ferrari", "4/4/2008Jeep", "3/3/2005Honda"
), V2 = c("Mitsubish", "Jaguar", "Land Rover", "Ferrari", "Jeep", 
"Honda")), row.names = c(NA, 6L), class = "data.frame")
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    Please share a reproducible sample of your data with `dput(head(data))`. – Anoushiravan R Jul 18 '21 at 14:55
  • 1
    Thanks for quick response. Here is the output of the data with ```dput(head(data))``` ```structure(list(V1 = c("3/3/2005Mitsubishi", "3/4/2006Jaguar", "13/2/2007Land Rover", "12/12/2009Ferrari", "4/4/2008Jeep", "3/3/2005Honda" ), V2 = c("Mitsubish", "Jaguar", "Land Rover", "Ferrari", "Jeep", "Honda")), row.names = c(NA, 6L), class = "data.frame")``` – CodingSamaritan Jul 18 '21 at 15:06

2 Answers2

3

In case you would like to keep all three columns. Just bear in mind that we defined two captured groups, one to capture date and the other to capture the rest. So in regex argument of tidyr::extract we defind how are strings are separated based on our defined captured groups and sit into as many columns:

library(tidyr)

df %>%
  extract(V1, c("V1", "V3"), "(\\d+\\/\\d+\\/\\d+)(.*)")

          V1         V3         V2
1   3/3/2005 Mitsubishi  Mitsubish
2   3/4/2006     Jaguar     Jaguar
3  13/2/2007 Land Rover Land Rover
4 12/12/2009    Ferrari    Ferrari
5   4/4/2008       Jeep       Jeep
6   3/3/2005      Honda      Honda

In case there are multiple columns you would like to separate, one way is to write a custom function and apply in on every column with data-string combination:

library(rlang)

fn <- function(...) {
  dots <- dots_list(...)
  extract(..., into = sprintf("%s_col_%d", dots[[2]], 1:2), regex = "(\\d+\\/\\d+\\/\\d+)(.*)")
}

# Now imagine we have a data set called df2 which has two columns we 
# would like to separate

df %>%
  mutate(V3 = V1) %>%
  select(V1, V3) -> df2

                   V1                  V3
1  3/3/2005Mitsubishi  3/3/2005Mitsubishi
2      3/4/2006Jaguar      3/4/2006Jaguar
3 13/2/2007Land Rover 13/2/2007Land Rover
4   12/12/2009Ferrari   12/12/2009Ferrari
5        4/4/2008Jeep        4/4/2008Jeep
6       3/3/2005Honda       3/3/2005Honda

Then we apply it on our data set choosing only columns that we would like to be separated:

# Instead of `names(df2)` you can choose a character vector containing the
# name of the columns you would like to separate

library(purrr)
names(df2) %>%
  reduce(~ fn(.x, .y), .init = df2)

    V1_col_1   V1_col_2   V3_col_1   V3_col_2
1   3/3/2005 Mitsubishi   3/3/2005 Mitsubishi
2   3/4/2006     Jaguar   3/4/2006     Jaguar
3  13/2/2007 Land Rover  13/2/2007 Land Rover
4 12/12/2009    Ferrari 12/12/2009    Ferrari
5   4/4/2008       Jeep   4/4/2008       Jeep
6   3/3/2005      Honda   3/3/2005      Honda

I found this post very helpful for this solution.

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    Thanks! That's a handy method. Is there any way that you know of to essentially clean the data and remove the date entirely from the dataset? In your results above, the date is now being stored as a standalone column, however, the aim is to remove the date entirely and just have eg. Mitsubishi, Jaguar etc. Thanks in advance – CodingSamaritan Jul 18 '21 at 15:29
  • I suspected that first, but thought in that case we would end up with 2 entirely identical variables. So you would like to only remove dates and keep the rest? – Anoushiravan R Jul 18 '21 at 15:30
  • 1
    Precisely. End goal is to clean the first column so only thing that remains is the string part and removing date in its entirety. Apologies for the confusion. – CodingSamaritan Jul 18 '21 at 15:32
  • 1
    Awesome. that did the right trick for removing the date. And I assume in the scenario that you want to keep all 3, you would need to amend the mutate code to also capture and display the date, correct? Thinking something along the lines of ```df %>% mutate(V1 = gsub("(\\d+\\/\\d+\\/\\d+)(.*)", "\\1,"\\2", V1))``` – CodingSamaritan Jul 18 '21 at 15:44
  • I have added another solution in case you wanted to keep all three columns. Mr. @jay.sf 's solution in also another great alternative. – Anoushiravan R Jul 18 '21 at 15:47
  • 1
    Thanks both @jay.sf. Both of these solutions were really helpful and yielded the correct results, so thanks again. One question on this - can this method be extrapolated to a larger dataset (more columns) or will this only work on this specific number of columns (i.e 2) Reason for asking is that I have now added additional columns with strings (circa 30 extra columns with single strings) and running the above renders NA values where previously they appeared correctly. – CodingSamaritan Jul 18 '21 at 16:06
  • I have presented a new solution for that purpose. – Anoushiravan R Jul 18 '21 at 16:58
1

strsplit at a (emulated) lookbehind.

r <- data.frame(do.call(rbind, 
                        strsplit(dat[, 1], '\\d{4}\\K', perl=TRUE)), dat[-1])
r[, 1] <- as.Date(r[, 1], '%D')  ## transform the date
r
#           X1         X2         V2
# 1 2003-03-20 Mitsubishi  Mitsubish
# 2 2003-04-20     Jaguar     Jaguar
# 3 2013-02-20 Land Rover Land Rover
# 4 2012-12-20    Ferrari    Ferrari
# 5 2004-04-20       Jeep       Jeep
# 6 2003-03-20      Honda      Honda

This answer explains the \K well.

Edit

Or leave out the "transform the date" step if you don't need the date, and do r <- r[-1].


Data:

dat <- structure(list(V1 = c("3/3/2005Mitsubishi", "3/4/2006Jaguar", 
"13/2/2007Land Rover", "12/12/2009Ferrari", "4/4/2008Jeep", "3/3/2005Honda"
), V2 = c("Mitsubish", "Jaguar", "Land Rover", "Ferrari", "Jeep", 
"Honda")), row.names = c(NA, 6L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110