4

Suppose a vector:

xx.1 <- c("zz_ZZ_uu_d", "II_OO_d")

I want to get a new vector splitted from right most and only split once. The expected results would be:

c("zz_ZZ_uu", "d", "II_OO", "d").

It would be like python's rsplit() function. My current idea is to reverse the string, and split the with str_split() in stringr.

Any better solutions?

update
Here is my solution returning n splits, depending on stringr and stringi. It would be nice that someone provides a version with base functions.

rsplit <- function (x, s, n) {
  cc1 <- unlist(stringr::str_split(stringi::stri_reverse(x), s, n))
  cc2 <- rev(purrr::map_chr(cc1, stringi::stri_reverse))
  return(cc2)
}
ccshao
  • 499
  • 2
  • 8
  • 19

5 Answers5

7

Negative lookahead:

unlist(strsplit(xx.1, "_(?!.*_)", perl = TRUE))
# [1] "zz_ZZ_uu" "d"        "II_OO"    "d"     

Where a(?!b) says to find such an a which is not followed by a b. In this case .*_ means that no matter how far (.*) there should not be any more _'s.

However, it seems to be not that easy to generalise this idea. First, note that it can be rewritten as positive lookahead with _(?=[^_]*$) (find _ followed by anything but _, here $ signifies the end of a string). Then a not very elegant generalisation would be

rsplit <- function(x, s, n) {
  p <- paste0("[^", s, "]*")
  rx <- paste0(s, "(?=", paste(rep(paste0(p, s), n - 1), collapse = ""), p, "$)")
  unlist(strsplit(x, rx, perl = TRUE))
}

rsplit(vec, "_", 1)
# [1] "a_b_c_d_e_f" "g"           "a"           "b"          
rsplit(vec, "_", 3)
# [1] "a_b_c_d" "e_f_g"   "a_b"    

where e.g. in case n=3 this function uses _(?=[^_]*_[^_]*_[^_]*$).

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • 1
    I'm not familiar with perl, could you explain it a little bit, and how should I change it if I want to split two or more "_"? – ccshao Dec 08 '13 at 16:09
2

Another two. In both I use "(.*)_(.*)" as the pattern to capture both parts of the string. Remember that * is greedy so the first (.*) will match as many characters as it can.

Here I use regexec to capture where your susbtrings start and end, and regmatches to reconstruct them:

unlist(lapply(regmatches(xx.1, regexec("(.*)_(.*)", xx.1)),
              tail, -1))

And this one is a little less academic but easy to understand:

unlist(strsplit(sub("(.*)_(.*)", "\\1@@@\\2", xx.1), "@@@"))
flodel
  • 87,577
  • 21
  • 185
  • 223
1

What about just pasting it back together after it's split?

rsplit <- function( x, s ) {
  spl <- strsplit( "zz_ZZ_uu_d", s, fixed=TRUE )[[1]]
  res <- paste( spl[-length(spl)], collapse=s, sep="" )
  c( res, spl[length(spl)]  )
}
> rsplit("zz_ZZ_uu_d", "_")
[1] "zz_ZZ_uu" "d"  
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
1

I also thought about a very similar approach to that of Ari

> res <- lapply(strsplit(xx.1, "_"), function(x){
    c(paste0(x[-length(x)], collapse="_" ), x[length(x)])
  }) 

> unlist(res)
[1] "zz_ZZ_uu" "d"        "II_OO"    "d"  
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
0

This gives exactly what you want and is the simplest approach:

require(stringr)
as.vector(t(str_match(xx.1, '(.*)_(.*)') [,-1]))
[1] "zz_ZZ_uu" "d"        "II_OO"    "d"

Explanation:

  • str_split() is not the droid you're looking for, because it only does left-to-right split, and splitting then repasting all the (n-1) leftmost matches is a total waste of time. So use str_split() with a regex with two capture groups. Note the first (.*)_ will greedy match everything up to the last occurrence of _, which is what you want. (This will fail if there isn't at least one _, and return NAs)
  • str_match() returns a matrix where the first column is the entire string, and subsequent columns are individual capture groups. We don't want the first column, so drop it with [,-1]
  • as.vector() will unroll that matrix column-wise, which is not what you want, so we use t() to transpose it to unroll row-wise
  • str_match(string, pattern) is vectorized over both string and pattern, which is neat
smci
  • 32,567
  • 20
  • 113
  • 146
  • By the way, if you do a lot of this, define a custom function `str_rsplit(...) <- function(...) { ... }` – smci Sep 21 '16 at 06:44