2

I have a dataframe containing a number of ensembl gene annotations, the DF looks like this:

        geneID
1  ENSG00000000005.5
2  ENSG00000001561.6
3 ENSG00000002726.18
4 ENSG00000005302.16
5 ENSG00000005379.14
6  ENSG00000006116.3

so I would like to delete that "." and the numbers at the end of every ID. In total I have 11224 rows. I tried using the gsub command gsub(".","",colnames(dataframe)) but this is not helping.

Any suggestions? Thank you in advance.

Biocrazy
  • 403
  • 2
  • 15
  • Is there a case where you would have non-number and would want to leave it intact? i.e. `ENSG0000000005.TR` to remain the same...or `ENSG000000005.5E` and be left with `ENSG000000005.E`? If not and you want to always remove everything after the dot then this is a duplicate of [this question](https://stackoverflow.com/questions/10617702/remove-part-of-string-after) – Sotos Jul 31 '17 at 14:48

2 Answers2

1

If we need the . at the end, capture the characters until the . (as . is a metacharacter meaning any character, escape it (\\) ) followed by one or more numbers (\\d+) until the end of the string and replace with the backreference (\\1) of the captured group

df1$geneID <- sub("^(.*\\.)\\d+$", "\\1", df1$geneID)

If the intention is to remove the . with the numbers after that, match the dot followed by one or more numbers and replace with blank ("")

df1$geneID <- sub("\\.\\d+", "", df1$geneID)
df1$geneID
#[1] "ENSG00000000005" "ENSG00000001561" "ENSG00000002726" "ENSG00000005302"
#[5] "ENSG00000005379" "ENSG00000006116"
akrun
  • 874,273
  • 37
  • 540
  • 662
0

You can use following code to remove alphanumeric after '.'

gsub("\\..*", "", df$geneID)
Sagar
  • 2,778
  • 1
  • 8
  • 16