2

I want to extract the country from a vector with strings like 'M_South_Africa_5_14' and 'P_Zimbabwe_Tot'. I have been trying unsuccessfully to do it with a single grep or stringr::str.extract statement. Of course, I could break the string by '_' and then collect the pieces, but would it be possible to do it with a regular expression?

grep(value = TRUE, 
     x = 'M_South_Africa_5_14', 
     pattern = '(?!^[PMF]{1})(?![_])([A-Za-z]{2,20})[_][A-Za-z]{2,20}(?!$)|(?!^[PMF]{1})(?![_])([A-Za-z]{2,20})', 
     perl = TRUE)

It would be great to simplify this regex monster, of course, but I actually want to know if I can possibly run regex lookups on R.

dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • It doesn't seem like there is regularity in the strings you need to parse .. how do you know if the country is only one word or not? Does the country always start at the second word? – Explosion Pills Feb 20 '13 at 23:48
  • 1
    Is the pattern `M/F_Country_Name_Agegrpstart_Agegrpend`? – thelatemail Feb 20 '13 at 23:51
  • Yes, there is regularity. First letter is M for Male, F for Female and P for Population total (M+F); then country names (one or two words); then either an age bracket or 'Tot' for total. My regex seems to work neatly at [this engine](http://www.regex101.com/), but not in R. – dmvianna Feb 20 '13 at 23:54
  • 1
    Does `gsub('[FMP]_([A-z_]+)_[0-9|T]+.*', '\\1', x)` work? It does for your two examples... but a country with a second word beginning in a capital `T` will fail... – Justin Feb 21 '13 at 00:01

1 Answers1

4

This works on your example:

> library(gsubfn)
> x <- c('M_South_Africa_5_14', 'P_Zimbabwe_Tot')
> pat <- "_(.*\\D)_"
> strapplyc(x, pat)
[[1]]
[1] "South_Africa"

[[2]]
[1] "Zimbabwe"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is *really* neat, but I am still curious as to whether one can do lookups in R. – dmvianna Feb 21 '13 at 00:09
  • 1
    @dmvianna, [**yes you can**](http://stackoverflow.com/questions/14529473/remove-last-occurrence-of-character) – Arun Feb 21 '13 at 00:24
  • @dmvianna, This returns indexes of the components matching the `pat` pattern: ` y <- c("xyz_def", "xy_9_a", x); grep(pat, y)` where `pat` and `x` are as in the post. – G. Grothendieck Feb 21 '13 at 00:32