0

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this

"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"

I need the data to look like this

attr      name            state
1         Peter Johnson   IN
2         Chet Charles    TX
3         Ed Walsh        AZ
4         Ralph Hogan     TX
5         Michael Johnson FL

I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.

Kavita
  • 11
  • 2

1 Answers1

-1

If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.

d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
                    "\\1;", lines), "[,;]"), function(x) {
                        x1 <- trimws(x)
       data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))     
cbind(attr = seq_len(nrow(d1)), d1)
#  attr            name state
#1    1   Peter Johnson    IN
#2    2    Chet Charles    TX
#3    3        Ed Walsh    AZ
#4    4     Ralph Hogan    TX
#5    5 Michael Johnson    FL

Or this can be done in a compact way

library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
        col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
#             names state attr
#1:   Peter Johnson    IN    1
#2:    Chet Charles    TX    2
#3:        Ed Walsh    AZ    3
#4:     Ralph Hogan    TX    4
#5: Michael Johnson    FL    5

data

lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
 Ralph Hogan, TX, Michael Johnson, FL"))
akrun
  • 874,273
  • 37
  • 540
  • 662