Extract characters up to "/" using R

Question

I am trying to extract characters before and after the "/" character using R.

For example, I can get the tags with the following:

s <- "hello/JJ world/NN"

# get the tags
sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\2", x)})

which returns

"JJ NN"

However, when I try to extract the characters before the "/" or the "tokens", using the following:

sapply(s, function(x){gsub("([a-z].*?)/([A-z].*?)", "\\1", x)})

I get

"helloJ worldN"

How can I get "hello world" and why is the first letter of the tag slipping in there?

you're using sapply on a vector of length one. why not just `gsub('/[a-z]+', '', s, ignore.case = TRUE)` and `gsub('[a-z]+/', '', s, ignore.case = TRUE)` ? — rawr, Aug 02 '15 at 22:33

score 3 · Accepted Answer · edited May 23 '17 at 12:06

3

I think the reason you get those letters remaining in the output is your regex. The [A-Z] (there must be Z, I guess z is a typo - see [A-Za-z] Shorthand class?) is OK, but it is followed by a .*? lazy dot matching group that can match 0 or unlimited characters other than newline as few as possible. So, it will match none.

You need a + quantifier to match 1 or more characters and apply it to the character class [a-zA-Z]:

s <- "hello/JJ world/NN"
sapply(s, function(x){gsub("([a-zA-Z])/[a-zA-Z]+", "\\1", x)})

See demo

I removed the second group since you are not using it.

edited May 23 '17 at 12:06

Community

1
1

answered Aug 02 '15 at 22:39

Wiktor Stribiżew

607,720
39
448
563

or you could just copy/paste the output below the code? – rawr Aug 02 '15 at 22:48

Extract characters up to "/" using R

1 Answers1