110

I am working with NCBI Reference Sequence accession numbers like variable a:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")  

To get information from the biomart package I need to remove the .1, .2 etc. after the accession numbers. I normally do this with this code:

b <- sub("..*", "", a)

# [1] "" "" "" "" "" ""

But as you can see, this isn't the correct way for this variable. Can anyone help me with this?

benson23
  • 16,369
  • 9
  • 19
  • 38
Lisann
  • 5,705
  • 14
  • 41
  • 50

6 Answers6

158

You just need to escape the period:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")

gsub("\\..*","",a)
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155" 
Hansi
  • 2,566
  • 1
  • 15
  • 19
  • Clarification: With functions in the base package (i.e. without other packages like `string r`), the options are as posted: b1 <- gsub("\\..*","",a, fixed=FALSE) b2 <- sub("\\..*","",a, fixed=FALSE) In certain cases, you may need to change the `fixed` argument. However, here you *must* have it set to `FALSE` (which is the default); otherwise it won't work. Furthermore, you need the double escape `\\`, or you get an error. – David C. Nov 22 '16 at 21:59
  • You wouldn't use it with fixed as TRUE because we're using regular expression here. – Hansi Nov 23 '16 at 16:20
17

We can pretend they are filenames and remove extensions:

tools::file_path_sans_ext(a)
# [1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"
zx8754
  • 52,746
  • 12
  • 114
  • 209
12

You could do:

sub("*\\.[0-9]", "", a)

or

library(stringr)
str_sub(a, start=1, end=-3)
johannes
  • 14,043
  • 5
  • 40
  • 51
  • 6
    Alternatives: `str_replace(a,"\\.[0-9]","")` and `str_replace(a,"\\..*","")` – Paolo May 17 '12 at 15:29
  • 4
    The `str_sub(a, start = 1, end = -3)` solution assumes that there are **only two characters** to remove (the "." and a single digit after it). For many gene ID systems, there could be multiple digits in the version (especially with probe IDs for instance). In this case, a more flexible solution would be `str_remove(a, pattern = "\\..*")`. In the code above, the pattern is to find the first period (using `"\\."`), then *any* character after that (`"."`) *any* number of times (`"*"`). – Gabriel J. Odom Aug 05 '21 at 20:04
9

If the string should be of fixed length, then substr from base R can be used. But, we can get the position of the . with regexpr and use that in substr

substr(a, 1, regexpr("\\.", a)-1)
#[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"   
akrun
  • 874,273
  • 37
  • 540
  • 662
4

We can use a lookahead regex to extract the strings before ..

library(stringr)

str_extract(a, ".*(?=\\.)")
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"   
[5] "NM_011419"    "NM_053155"   
benson23
  • 16,369
  • 9
  • 19
  • 38
0

Another option is to use str_split from stringr:

library(stringr)
str_split(a, "\\.", simplify=T)[,1]
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"   
user438383
  • 5,716
  • 8
  • 28
  • 43