6

I have a string in R in the following form:

example <- c("namei1 namej1, surname1, name2, surnamei2 surnamej2, name3, surname3")

And I wish to obtain two columns:

namei1 namej1   | surname1
name2           | surnamei2 surnamej2
name3           | surname3

I try using string split:

example <- c("namei1 namej1, surname1, name2, surnamei2 surnamej2, name3, surname3")
pattern <- "\\,+[[:space:]]"
str_split(example, pattern)

But, I get stuck from here…

M--
  • 25,431
  • 8
  • 61
  • 93
anespinosa
  • 73
  • 3

3 Answers3

5
read.csv(text = gsub("([^,]+,[^,]+),", "\\1\n", example), 
         header = FALSE, stringsAsFactors = FALSE)
#              V1                   V2
# 1 namei1 namej1             surname1
# 2         name2  surnamei2 surnamej2
# 3         name3             surname3
M--
  • 25,431
  • 8
  • 61
  • 93
  • Could you explain the regex? – camille Sep 04 '19 at 19:08
  • @camille it looks for commas in pair and replaces the second one (`\\1` keeps the first one) with `\n` which is `newline`. `read.csv` uses comma as column-delimiter and `\n` will be interpreted as new row. I could've used `(,[^,]*),` as well. – M-- Sep 04 '19 at 20:31
4

We can split the string at , followed by zero or more spaces (\\s*), then create a grouping variable based on the occurance of 'name' string and split the vector (v1) into a list of vectors, rbind thelistelements and convert it to adata.frame`

v1 <- strsplit(example, ",\\s*")[[1]]
setNames(do.call(rbind.data.frame, split(v1, cumsum(grepl('\\bname',
       v1)))), paste0("V", 1:2))
#       V1                  V2
#1 namei1 namej1            surname1
#2         name2 surnamei2 surnamej2
#3         name3            surname3

Or another option is scan and convert it to a two column matrix

as.data.frame( matrix(trimws(scan(text = example, sep=",",
      what = "", quiet = TRUE)), byrow = TRUE, ncol = 2))
#       V1                  V2
#1 namei1 namej1            surname1
#2         name2 surnamei2 surnamej2
#3         name3            surname3

Or another option is gsub where we replace the , followed by space and 'name' string with \n and 'name' and use that in. read.csv to split based on the delimiter ,

read.csv(text = gsub(", name", "\nname", example), header= FALSE)
#         V1                   V2
#1 namei1 namej1             surname1
#2         name2  surnamei2 surnamej2
#3         name3             surname3
akrun
  • 874,273
  • 37
  • 540
  • 662
  • What happens if you skip `trimws`? – M-- Sep 04 '19 at 20:45
  • 1
    @M- issue is that `scan` `sep` takes a single character. I could include `strip.white = TRUE` and avoid the `trimws` though `scan(text = example, sep=",", what = "", quiet = TRUE, strip.white = TRUE)`. I thought if i use `trimws`, it can be made more compact. Nothing else – akrun Sep 04 '19 at 20:48
3
data.frame(split(unlist(strsplit(example, ", ")), c(0, 1)))
#             X0                  X1
#1 namei1 namej1            surname1
#2         name2 surnamei2 surnamej2
#3         name3            surname3
d.b
  • 32,245
  • 6
  • 36
  • 77