6

I have the following sequence:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

What I want to do is to detect range of positions of non-dashed characters.

----?????-----?V?D????-------???IL??A?---
|   |   |     |      |       |       |  
1   5   9    15     22      30      38

The final output will be a vector of strings:

out <- c("5-9", "15-22", "30-38")

How can I achieve that with R?

littleworth
  • 4,781
  • 6
  • 42
  • 76

6 Answers6

10

Please find below, one other possible solution using the stringr library

Reprex

  • Code
library(stringr)

s <- as.data.frame(str_locate_all(my_seq, "[^-]+")[[1]])
result <- paste(s$start, s$end, sep ="-")
  • Output
result
#> [1] "5-9"   "15-22" "30-38"

Created on 2022-02-18 by the reprex package (v2.0.1)

lovalery
  • 4,524
  • 3
  • 14
  • 28
6

You could do:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos      <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]

apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9"   "15-22" "30-38"

Created on 2022-02-18 by the reprex package (v2.0.1)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
5

Inspired from @lovalery's great answer, a base R solution is:

g <- gregexpr(pattern = "[^-]+", my_seq)
d <-data.frame(start = unlist(g), 
           end = unlist(g) + attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5"   "11-18" "26-34"
Maël
  • 45,206
  • 3
  • 29
  • 67
5

A one-liner in base R with utf8ToInt

apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9"   "15-22" "30-38"
jblood94
  • 10,340
  • 1
  • 10
  • 15
4

Try

paste0(gregexec('-\\?', my_seq)[[1]][1,] + 1, '-',
       gregexec('\\?-', my_seq)[[1]][1,])
#> [1] "5-9"   "15-22" "30-38"
jassis
  • 416
  • 2
  • 12
3

Here is a rle + tidyverse approach:

library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
     data.frame(lengths, values)) |>
  mutate(end = cumsum(lengths)) |>
  mutate(start =  1 + lag(end, 1,0)) |>
  mutate(rng = paste(start, end, sep = "-")) |>
  filter(values) |>
  pull(rng)

[1] "5-9"   "15-22" "30-38"

However if you don't mind installing S4Vectors the code can be made really terse:

library(S4Vectors)

r <- Rle(strsplit(my_seq, "")[[1]] != "-")

paste(start(r), end(r), sep = "-")[runValue(r)]

[1] "5-9"   "15-22" "30-38"
Stefano Barbi
  • 2,978
  • 1
  • 12
  • 11