How to remove certain items from a vector?

Question

Example vector (gene transcript ids):

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')

This is subset of a long vector, how can I remove item begin with 'MS', then cut off the end 2 digit of left items?

Related post: [Remove part of string after “.”](https://stackoverflow.com/questions/10617702) — zx8754, Mar 15 '19 at 11:25

Ronak Shah · Answer 1 · 2019-03-15T02:47:54.237

If we want to avoid regex completely as @sindri_baldur mentions we can use

string <- a[!startsWith(a, "MS")]
substr(string, 1, nchar(string) - 2)

Or with grep and substr

string <- grep('^MS',a, invert = TRUE, value = TRUE)
substr(string, 1, nchar(string) - 2)
#[1] "AT2G26340" "AT2G26355"

Since we have quite a few new answers adding benchmark including all of them with a vector of length 400k.

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')
a <- rep(a, 100000)

library(microbenchmark)

microbenchmark(
ronak1 = {string <- a[!startsWith(a, "MS")];substr(string, 1, nchar(string) - 2)}, 
ronak2 = {string <- grep('^MS',a, invert = TRUE, value = TRUE);substr(string, 1, nchar(string) - 2)}, 
sotos = {word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))}, 
thothal = {b1 <- a[!grepl("^MS", a)];gsub("\\.[0-9]$", "", b1)}, 
zx8754 = tools::file_path_sans_ext(a[ !grepl("^MS", a) ]), 
tmfmnk = dirname(chartr(".", "/", a[!grepl("^MS", a)])), 
NelSonGon = {b<-stringi::stri_replace_all(stringi::stri_sub(a,1,-3),regex="^M.*","");b[grepl('\\w+',b)]}
)


#Unit: milliseconds
#      expr        min         lq       mean     median         uq       max neval
#    ronak1   34.75928   38.58217   45.63393   40.32845   44.24355  225.2581   100
#    ronak2   94.10687   96.72758  110.83819   99.26914  105.98822  938.2969   100
#     sotos 1926.21112 2500.27209 2852.43240 2861.61699 3173.10420 4478.7890   100
#   thothal  155.95328  160.62800  169.02275  164.46494  169.32770  218.5033   100
#    zx8754  172.96970  179.03618  186.12374  183.96887  188.06251  234.1895   100
#    tmfmnk  189.29085  195.14593  208.89245  199.47172  204.40604  547.7497   100
# NelSonGon  186.54426  198.29856  226.19221  206.54542  217.92970  948.2535   100

Your second step avoids regex, you can avoid regex completely with something like `a[!startsWith(a, "MS")]` — s_baldur, Mar 14 '19 at 08:22
@sindri_baldur right. It's good to have a non-regex option as well. — Ronak Shah, Mar 14 '19 at 08:33
I'd move your second answer with `startsWith` to the top, as it is twice faster on 100K vector. — zx8754, Mar 14 '19 at 09:02
@zx8754 done. That's pretty interesting to know. Do you think using regex makes it slower? — Ronak Shah, Mar 14 '19 at 09:10
No idea, help page says it is faster than *substring* and *grepl*, so decided to test, and it is fast. But eats twice more memory, see benchmarks below. — zx8754, Mar 14 '19 at 09:13

Sotos · Answer 2 · 2019-03-14T08:03:39.743

6

Here is a stringr one-liner as well,

library(stringr)

word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))
#[1] "AT2G26340" "AT2G26355"

edited Mar 14 '19 at 08:03

answered Mar 14 '19 at 07:42

Sotos

51,121
6
32
66

score 3 · Answer 3 · answered Mar 14 '19 at 07:36

3

Code

a <- a[!grepl("^MS", a)]
gsub("\\.[0-9]$", "", a)
# [1] "AT2G26340" "AT2G26355"

Explanation

Use regex to filter out all elements which start with MS
Use regex again to replace the dot and the last digit from the remaining elements

answered Mar 14 '19 at 07:36

thothal

16,690
3
36
71

score 2 · Answer 4 · answered Mar 14 '19 at 09:05

As there are about 200K transcripts in human, here is the benchmark:

a <- c('MSTRG.7176.1', 'MSTRG.7176.2', 'AT2G26340.2', 'AT2G26355.1')
a <- rep(a, 25000)

library(stringr)

bench::mark(
  x1 = {
    string <- grep('^MS',a, invert = TRUE, value = TRUE)
    substr(string, 1, nchar(string) - 2) },
  x2 = {
    string <- a[!startsWith(a, "MS")]
    substr(string, 1, nchar(string) - 2)},
  x3 = {
    word(a[!str_detect(a, '^MS')], 1, sep = fixed('.'))  
  },
  x4 = {
    gsub("\\.[0-9]$", "", a[ !grepl("^MS", a) ])},
  x5 = {
    tools::file_path_sans_ext(a[ !grepl("^MS", a) ])  
  }
  )

# A tibble: 5 x 14
# expression      min     mean  median     max `itr/sec` mem_alloc  n_gc n_itr total_time result memory  time  gc   
# <chr>      <bch:tm> <bch:tm> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list>  <lis> <lis>
# x1           20.3ms   21.3ms    21ms  28.1ms     46.9     1.91MB     1    24      512ms <chr ~ <Rprof~ <bch~ <tib~
# x2           11.7ms   12.6ms  12.3ms  17.8ms     79.3     2.86MB     3    40      505ms <chr ~ <Rprof~ <bch~ <tib~
# x3          668.5ms  668.5ms 668.5ms 668.5ms      1.50   10.54MB     9     1      668ms <chr ~ <Rprof~ <bch~ <tib~
# x4           23.8ms   24.6ms  24.1ms  32.2ms     40.7      2.1MB     1    21      516ms <chr ~ <Rprof~ <bch~ <tib~
# x5           33.8ms   35.2ms  34.7ms  40.9ms     28.4      2.1MB     1    15      528ms <chr ~ <Rprof~ <bch~ <tib~

score 1 · Answer 5 · answered Mar 14 '19 at 08:39

1

Think of them as filenames and drop the extension:

tools::file_path_sans_ext(a[ !grepl("^MS", a) ])
# [1] "AT2G26340" "AT2G26355"

answered Mar 14 '19 at 08:39

zx8754

52,746
12
114
209

score 1 · Answer 6 · answered Mar 15 '19 at 03:02

1

I don't see a combination of sub() and startsWith(), so

sub(".{2}$", "", a[!startsWith(a, "MS")])
# [1] "AT2G26340" "AT2G26355"

answered Mar 15 '19 at 03:02

Rich Scriven

97,041
11
181
245

tmfmnk · Answer 7 · 2019-03-14T09:48:41.320

You can also try:

dirname(chartr(".", "/", a[!grepl("^MS", a)]))

[1] "AT2G26340" "AT2G26355"

First, with grepl() it identifies the cases that starts with MS. Second, it replaces . with / using chartr(). Finally, dirname() returns the part of the strings up to the last /.

Considering that there may be elements not starting with MS but containing two or more decimals, you can use:

chartr("/", ".", dirname(chartr(".", "/", a[!grepl("^MS", a)])))

It's the same as the first possibility but it replaces the remaining / back to ..

Or the second possibility with replacing chartr() with gsub():

gsub("/", ".", dirname(gsub(".", "/", a[!grepl("^MS", a)], fixed = TRUE)), 
     fixed = TRUE)

NelsonGon · Answer 8 · 2019-03-14T12:43:38.920

0

Providing a stringi possibility: I prefer one liners but maybe a two line solution might suffice.

 b<-stringi::stri_replace_all(stringi::stri_sub(a,1,-3),regex="^M.*","")
b[grepl('\\w+',b)]
#[1] "AT2G26340" "AT2G26355"

edited Mar 14 '19 at 12:43

answered Mar 14 '19 at 12:36

NelsonGon

13,015
7
27
57

How to remove certain items from a vector?

8 Answers8

Linked