0

This feels like a fairly difficult data manipulation / dataframe fixup issue in R. We have the following messy dataframe, currently organized such that multiple columns of information are packed into the X2 column. Using fake names, emails, phone numbers in the example below:

coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach", 
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu", 
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant Coachgabensspar@gmail.edu", 
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associateabbetter@gmail.edu\r\n   111-222-3333", 
"Nick Romanov\r\n Head Crew Coach\r\n nick.nick@school.edu\r\n 123-123-1234", 
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L, 
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")

head(coach_info, 4)
    X1                                                                                   X2
1 <NA>                                                                   TBA\r\n Head Coach
2 <NA>                             Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA>                                    Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu

We're trying to split the X2 column info into 4 columns for Name, Title, Email and Phone. When we strsplit(coach_info$X2, '\r\n'), what we get is a messy nested list, and the splitting using \r\n is imperfect as the \r\n are missing in some of the rows:

enter image description here enter image description here enter image description here

In addition to this, the inner-nested lists each have differing numbers of elements, as many rows are missing 1 or more of the name, phone number or email address:

> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
 [1] 2 2 2 3 1 2 3 4 2

Our goal is something as close as possible to this:

output_df <- data.frame(
    Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
    Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
              NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
              "Head Crew Coach", "Assistant Coach"),
    Email = c(NA, NA, NA, "Tom.Tommy@abc.edu", NA, "Coachgabensspar@gmail.edu", "abbetter@gmail.edu", "nick.nick@school.edu", NA),
    Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
    stringsAsFactors = FALSE
  )
  

>   head(output_df, 4)
          Name                                   Title             Email          Phone
1          TBA                              Head Coach              <NA>           <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach              <NA>           <NA>
3   Jimmy Jimm      Assistant Women's Basketball Coach              <NA>           <NA>
4   Rod Barber             Head Men's Basketball Coach Tom.Tommy@abc.edu (123) 456-7890

It seems like it may be impossible to cleanly split up the strings where no spaces or \r\n exist between different fields, like in the screenshots above. We're just trying to get as close as possible at this point...

Canovice
  • 9,012
  • 22
  • 93
  • 211

1 Answers1

2

how about something like this

require(data.table)
setDT(coach_info)

re.phone <- '.*(\\d{3}[^[:alnum:]]*\\d{3}[^[:alnum:]]*\\d{4}).*'
re.email <- ".*[^_[:alnum:]\\-\\.]([_[:alnum:]\\-\\.]+@[[:alnum:]\\.]+).*"
re.text1 <- '([[:alnum:][:blank:]]+)\r\n([[:alnum:][:blank:][:punct:]]+).*'


coach_info[,processed:=X2]

coach_info[grepl(re.phone,X2), phone:=gsub(re.phone,'\\1',X2)]
coach_info[!is.na(phone), processed:=gsub(phone,' ',X2,fixed=T),by=phone]

coach_info[grepl(re.email,processed), email:=gsub(re.email,'\\1',processed)]
coach_info[!is.na(email), processed:=gsub(email,' ',processed,fixed=T),by=email]

coach_info[, Name:=gsub(re.text1,'\\1',processed)]
coach_info[, Title:=gsub(re.text1,'\\2',processed)]
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • Yes this gets us 85 - 90% of the way there which is helpful. I'm not as familiar with DT as opposed to the tidyverse, but its still not too tough to follow this. – Canovice Jul 19 '21 at 06:46
  • if there are any records that are not processed correctly with this code, feel free to add them to your example, and I or someone else will try to help – Vasily A Jul 19 '21 at 07:43