Remove part of string after "."

Question

I am working with NCBI Reference Sequence accession numbers like variable a:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")

To get information from the biomart package I need to remove the .1, .2 etc. after the accession numbers. I normally do this with this code:

b <- sub("..*", "", a)

# [1] "" "" "" "" "" ""

But as you can see, this isn't the correct way for this variable. Can anyone help me with this?

score 158 · Accepted Answer · answered May 16 '12 at 14:43

158

You just need to escape the period:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")

gsub("\\..*","",a)
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

answered May 16 '12 at 14:43

Hansi

2,566
1
15
19

Clarification: With functions in the base package (i.e. without other packages like `string r`), the options are as posted: b1 <- gsub("\\..*","",a, fixed=FALSE) b2 <- sub("\\..*","",a, fixed=FALSE) In certain cases, you may need to change the `fixed` argument. However, here you *must* have it set to `FALSE` (which is the default); otherwise it won't work. Furthermore, you need the double escape `\\`, or you get an error. – David C. Nov 22 '16 at 21:59
You wouldn't use it with fixed as TRUE because we're using regular expression here. – Hansi Nov 23 '16 at 16:20

score 17 · Answer 2 · answered Jun 14 '17 at 14:07

17

We can pretend they are filenames and remove extensions:

tools::file_path_sans_ext(a)
# [1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

answered Jun 14 '17 at 14:07

zx8754

52,746
12
114
209

score 12 · Answer 3 · answered May 16 '12 at 11:44

12

You could do:

sub("*\\.[0-9]", "", a)

or

library(stringr)
str_sub(a, start=1, end=-3)

answered May 16 '12 at 11:44

johannes

14,043
5
40
51

6

Alternatives: `str_replace(a,"\\.[0-9]","")` and `str_replace(a,"\\..*","")` – Paolo May 17 '12 at 15:29
4

The `str_sub(a, start = 1, end = -3)` solution assumes that there are **only two characters** to remove (the "." and a single digit after it). For many gene ID systems, there could be multiple digits in the version (especially with probe IDs for instance). In this case, a more flexible solution would be `str_remove(a, pattern = "\\..*")`. In the code above, the pattern is to find the first period (using `"\\."`), then *any* character after that (`"."`) *any* number of times (`"*"`). – Gabriel J. Odom Aug 05 '21 at 20:04

score 9 · Answer 4 · answered Apr 24 '19 at 13:10

If the string should be of fixed length, then substr from base R can be used. But, we can get the position of the . with regexpr and use that in substr

substr(a, 1, regexpr("\\.", a)-1)
#[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

benson23 · Answer 5 · 2022-07-19T20:06:13.943

4

We can use a lookahead regex to extract the strings before ..

library(stringr)

str_extract(a, ".*(?=\\.)")
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"   
[5] "NM_011419"    "NM_053155"

edited Jul 19 '22 at 20:06

answered May 13 '22 at 14:22

benson23

16,369
9
19
38

score 0 · Answer 6 · answered May 13 '22 at 14:29

0

Another option is to use str_split from stringr:

library(stringr)
str_split(a, "\\.", simplify=T)[,1]

[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

answered May 13 '22 at 14:29

user438383

5,716
8
28
43

Remove part of string after "."

6 Answers6

Linked

Related