Split specific strings in a vector using regex

Question

I have a vector of strings, some of which include punctuations/symbols. For example:

words <- ("hi", "my.", "name!", "is98", ""joe"")

My goal is to create a vector that has all these words, but the punctuations, numbers, and symbols are made into their own string in the vector. So in this case

("hi", "my", ".", "name", "!", "is", "98", """, "joe", """)

My initial plan was to use grep to identify the indices where said punctuations exist, then loop through them and use strsplit to divide them based on said punctuations, as follows:

puncIndex <- grep('[\\"!?.^]', words)
for(i in puncIndex){
  strsplit(words[i], '[\\"!?.^]')
}

But I'm having a couple of problems. One being that I realize that the result of strsplit is going to be a list itself, and I can't figure out how to cleanly just move each of the components back to the original vector. The other being that even when I try strsplit on just one word, it only returns the first part. For example:

strsplit(words[2], ".")
[[1]]
[1] "my"

EDIT: added numbers as a class to be separated as well

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

3

You may try

  res <- unlist(strsplit(words, '(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)',
                   perl=TRUE))
  res
  #[1] "hi"   "my"   "."    "name" "!"    "is"   "\""   "joe"  "\""

Or using str_extract_all

 library(stringr)
 unlist(str_extract_all(words, '\\w+|\\W+'))
 #[1] "hi"   "my"   "."    "name" "!"    "is"   "\""   "joe"  "\""

EDIT: Added @Avinash Raj's suggestion

data

 words <- c("hi", "my.", "name!", "is", '"joe"')

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 09 '15 at 06:52

akrun

874,273
37
540
662

Thank you, the first solution worked perfect. One more question (apologies for forgetting to ask first): how would I alter it to separate numbers as well? – NeonBlueHair Jul 10 '15 at 01:21
@NeonBlueHair Please update your post with a new example that has numbers as well – akrun Jul 10 '15 at 06:17

Avinash Raj · Accepted Answer · 2015-07-10T07:02:47.207

2

Just split on the word boundary which exists at the middle.

words <- c("hi", "my.", "name!", "is", '"joe"')
unlist(strsplit(words, '(?<=.)\\b(?=.)', perl=TRUE))
#[1] "hi"   "my"   "."    "name" "!"    "is"   "\""   "joe" 
#[9] "\""

The trick here is \\b called word boundary which matches between a word char and non-word char (vice-versa). So this alone would match the start and the end if the starting and ending character is a word character. But using assertions, it ensures that there must be atleast one char exists before and after the word boundary.

Update:

library(stringr)
unlist(str_extract_all(words, '[A-Za-z]+|[^A-Za-z]+'))

edited Jul 10 '15 at 07:02

answered Jul 09 '15 at 07:09

Avinash Raj

172,303
28
230
274

Thanks, this worked as well! Is there a performance difference between using \b and using \w like the other proposed answer? – NeonBlueHair Jul 10 '15 at 01:22
Sorry, I should have phrased my question better. I meant more what the difference in output would be, given that I tried both and they both gave the same output. But reading your explanation again, I think I understand it. Thanks! – NeonBlueHair Jul 10 '15 at 06:58

Split specific strings in a vector using regex

2 Answers2

data