Select every nth character from a string

Question

I have a string of random letters with random spaces and some periods as well. I want to take every nth value (e.g. every 10th) from it. My thought was that if I can transpose it then I can use the row numbers to select for every nth value. Any help is appreciated!

string <- "hutmnycdsldzlkt.ytairuaypk  dq.gubgp hyfjuwvpcdmvqxfcuhapnx"

Henrik · Accepted Answer · 2021-02-23T21:21:30.797

9

To follow-up on OP's idea ("use the row numbers"). Split the string, fill a matrix with 10 rows, select the first row.

matrix(strsplit(x, "")[[1]], nrow = 10)[1, ]
# [1] "h" "d" "r" "." "j" "x"

You will get a recycling warning, but that will not affect us because we select the first row.

Good'ol charToRaw:

rawToChar(charToRaw(x)[c(TRUE, rep(FALSE, 9))])
# [1] "hdr.jx"

edited Feb 23 '21 at 21:21

answered Feb 23 '21 at 21:08

Henrik

65,555
14
143
159

akrun · Answer 2 · 2021-02-23T21:10:51.753

8

We can split the string and use seq to get the elements

v1 <- strsplit(string, "")[[1]]
v1[seq(1, by = 10, length(v1))]
#[1] "h" "d" "r" "." "j" "x"

Or with a regex lookaround

library(stringr)
str_replace_all(string, "(.).{1,9}", "\\1")
#[1] "hdr.jx"

Or make it dynamic with glue

n <- 9
str_replace_all(string, glue::glue("(.).{1,[n]}",
          .open = '[', .close = ']'), "\\1")
#[1] "hdr.jx"

edited Feb 23 '21 at 21:10

answered Feb 23 '21 at 21:00

akrun

874,273
37
540
662

score 5 · Answer 3 · answered Feb 23 '21 at 21:00

5

substring will take a vector of first= and last=, so we can form an appropriate sequence and go from there.

func <- function(x, n, start = 1) {
  vapply(x, function(z) {
    i <- seq.int(start, nchar(z), by = n)
    i <- i[i > 0]
    paste(substring(x, i, i), collapse = "")
  }, character(1))
}

func(string, 10)
# hutmnycdsldzlkt.ytairuaypk  dq.gubgp hyfjuwvpcdmvqxfcuhapnx 
#                                                    "hdr.jx"

where every 10 (starting at 1) is

hutmnycdsldzlkt.ytairuaypk  dq.gubgp hyfjuwvpcdmvqxfcuhapnx 
12345678901234567890123456789012345678901234567890123456789
^         ^         ^         ^         ^         ^
h         d         r         .         j         x

(The biggest reason I went with an apply variant is in case you have a vector of strings, where substring will work as elegantly.)

answered Feb 23 '21 at 21:00

r2evans

141,215
6
77
149

I think you can avoid the `vapply` and make this much faster by `rep`eating each `x` value many times along with the maximum `i` sequence and then only calling `substring` once. Something like: `func2 <- function(x, n, start = 1) { mnc <- max(nchar(x)); i <- seq.int(start, mnc, by = n); paste(substring(rep(x, each=length(i)), i, i), collapse="") }` – thelatemail Feb 23 '21 at 21:22
Yeah, I had thought about that. My initial thought (coded here) intentionally tried to not `substring` beyond a string's length, but in hindsight indexing beyond is silently 0-length, so an unnecessary precaution. I think your method is certainly simpler and likely faster. Thanks, @thelatemail. – r2evans Feb 23 '21 at 21:57
1

Although I may have spoken too soon - the result from my edit would still have to be broken out somehow to separate vector elements, so it's not quite right. – thelatemail Feb 23 '21 at 22:07

score 1 · Answer 4 · answered Feb 23 '21 at 23:12

1

A base R option using substring + seq + nchar

substring(
  string,
  v <- seq(1, nchar(string), by = 10),
  v
)

gives

"h" "d" "r" "." "j" "x"

answered Feb 23 '21 at 23:12

ThomasIsCoding

96,636
9
24
81

score 1 · Answer 5 · answered Feb 23 '21 at 23:41

Okay, here's an addition to @r2evans answer trying to speed up the vectorised substring operation by not having to loop it over each individual value.

func2 <- function(x, n, start = 1) {
    mnc <- max(nchar(x))
    i <- seq.int(start, mnc, by = n)
    res <- paste(substring(rep(x, each=length(i)), i, i), collapse="")
    fi <- findInterval(nchar(x), i)
    substring(res, c(1, head(cumsum(fi),-1) + 1), cumsum(fi) )
}

Quick test on 20K records:

x <- c("12345678901234567890", "09876543210987654321")
bigx <- rep(x,1e4)

system.time(func(bigx, 10, 1))
##   user  system elapsed 
##  38.29    0.03   38.36 

system.time(func2(bigx, 10, 1))
## user  system elapsed 
## 0.02    0.00    0.02

Select every nth character from a string

5 Answers5