6

Example vector (gene transcript ids):

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')

This is subset of a long vector, how can I remove item begin with 'MS', then cut off the end 2 digit of left items?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Donghui HU
  • 81
  • 1
  • 8
  • Related post: [Remove part of string after “.”](https://stackoverflow.com/questions/10617702) – zx8754 Mar 15 '19 at 11:25

8 Answers8

7

If we want to avoid regex completely as @sindri_baldur mentions we can use

string <- a[!startsWith(a, "MS")]
substr(string, 1, nchar(string) - 2)

Or with grep and substr

string <- grep('^MS',a, invert = TRUE, value = TRUE)
substr(string, 1, nchar(string) - 2)
#[1] "AT2G26340" "AT2G26355"

Since we have quite a few new answers adding benchmark including all of them with a vector of length 400k.

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')
a <- rep(a, 100000)

library(microbenchmark)

microbenchmark(
ronak1 = {string <- a[!startsWith(a, "MS")];substr(string, 1, nchar(string) - 2)}, 
ronak2 = {string <- grep('^MS',a, invert = TRUE, value = TRUE);substr(string, 1, nchar(string) - 2)}, 
sotos = {word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))}, 
thothal = {b1 <- a[!grepl("^MS", a)];gsub("\\.[0-9]$", "", b1)}, 
zx8754 = tools::file_path_sans_ext(a[ !grepl("^MS", a) ]), 
tmfmnk = dirname(chartr(".", "/", a[!grepl("^MS", a)])), 
NelSonGon = {b<-stringi::stri_replace_all(stringi::stri_sub(a,1,-3),regex="^M.*","");b[grepl('\\w+',b)]}
)


#Unit: milliseconds
#      expr        min         lq       mean     median         uq       max neval
#    ronak1   34.75928   38.58217   45.63393   40.32845   44.24355  225.2581   100
#    ronak2   94.10687   96.72758  110.83819   99.26914  105.98822  938.2969   100
#     sotos 1926.21112 2500.27209 2852.43240 2861.61699 3173.10420 4478.7890   100
#   thothal  155.95328  160.62800  169.02275  164.46494  169.32770  218.5033   100
#    zx8754  172.96970  179.03618  186.12374  183.96887  188.06251  234.1895   100
#    tmfmnk  189.29085  195.14593  208.89245  199.47172  204.40604  547.7497   100
# NelSonGon  186.54426  198.29856  226.19221  206.54542  217.92970  948.2535   100
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Your second step avoids regex, you can avoid regex completely with something like `a[!startsWith(a, "MS")]` – s_baldur Mar 14 '19 at 08:22
  • @sindri_baldur right. It's good to have a non-regex option as well. – Ronak Shah Mar 14 '19 at 08:33
  • I'd move your second answer with `startsWith` to the top, as it is twice faster on 100K vector. – zx8754 Mar 14 '19 at 09:02
  • @zx8754 done. That's pretty interesting to know. Do you think using regex makes it slower? – Ronak Shah Mar 14 '19 at 09:10
  • No idea, help page says it is faster than *substring* and *grepl*, so decided to test, and it is fast. But eats twice more memory, see benchmarks below. – zx8754 Mar 14 '19 at 09:13
6

Here is a stringr one-liner as well,

library(stringr)

word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))
#[1] "AT2G26340" "AT2G26355"
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

Code

a <- a[!grepl("^MS", a)]
gsub("\\.[0-9]$", "", a)
# [1] "AT2G26340" "AT2G26355"

Explanation

  1. Use regex to filter out all elements which start with MS
  2. Use regex again to replace the dot and the last digit from the remaining elements
thothal
  • 16,690
  • 3
  • 36
  • 71
2

As there are about 200K transcripts in human, here is the benchmark:

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')
a <- rep(a, 25000)

library(stringr)

bench::mark(
  x1 = {
    string <- grep('^MS',a, invert = TRUE, value = TRUE)
    substr(string, 1, nchar(string) - 2) },
  x2 = {
    string <- a[!startsWith(a, "MS")]
    substr(string, 1, nchar(string) - 2)},
  x3 = {
    word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))  
  },
  x4 = {
    gsub("\\.[0-9]$", "", a[ !grepl("^MS", a) ])},
  x5 = {
    tools::file_path_sans_ext(a[ !grepl("^MS", a) ])  
  }
  )

# A tibble: 5 x 14
# expression      min     mean  median     max `itr/sec` mem_alloc  n_gc n_itr total_time result memory  time  gc   
# <chr>      <bch:tm> <bch:tm> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list>  <lis> <lis>
# x1           20.3ms   21.3ms    21ms  28.1ms     46.9     1.91MB     1    24      512ms <chr ~ <Rprof~ <bch~ <tib~
# x2           11.7ms   12.6ms  12.3ms  17.8ms     79.3     2.86MB     3    40      505ms <chr ~ <Rprof~ <bch~ <tib~
# x3          668.5ms  668.5ms 668.5ms 668.5ms      1.50   10.54MB     9     1      668ms <chr ~ <Rprof~ <bch~ <tib~
# x4           23.8ms   24.6ms  24.1ms  32.2ms     40.7      2.1MB     1    21      516ms <chr ~ <Rprof~ <bch~ <tib~
# x5           33.8ms   35.2ms  34.7ms  40.9ms     28.4      2.1MB     1    15      528ms <chr ~ <Rprof~ <bch~ <tib~
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Think of them as filenames and drop the extension:

tools::file_path_sans_ext(a[ !grepl("^MS", a) ])
# [1] "AT2G26340" "AT2G26355"
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

I don't see a combination of sub() and startsWith(), so

sub(".{2}$", "", a[!startsWith(a, "MS")])
# [1] "AT2G26340" "AT2G26355"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
0

You can also try:

dirname(chartr(".", "/", a[!grepl("^MS", a)]))

[1] "AT2G26340" "AT2G26355"

First, with grepl() it identifies the cases that starts with MS. Second, it replaces . with / using chartr(). Finally, dirname() returns the part of the strings up to the last /.

Considering that there may be elements not starting with MS but containing two or more decimals, you can use:

chartr("/", ".", dirname(chartr(".", "/", a[!grepl("^MS", a)])))

It's the same as the first possibility but it replaces the remaining / back to ..

Or the second possibility with replacing chartr() with gsub():

gsub("/", ".", dirname(gsub(".", "/", a[!grepl("^MS", a)], fixed = TRUE)), 
     fixed = TRUE)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

Providing a stringi possibility: I prefer one liners but maybe a two line solution might suffice.

 b<-stringi::stri_replace_all(stringi::stri_sub(a,1,-3),regex="^M.*","")
b[grepl('\\w+',b)]
#[1] "AT2G26340" "AT2G26355"
NelsonGon
  • 13,015
  • 7
  • 27
  • 57