1

im new to R, i have this data frame and im trying to delet all the infromation from this column except the genes symbols which always comes secound in place within the string. enter image description here best regards!

i tried this function (gsub) but it deleted the specific element only . i`m wandring if i can use it to keep the gene symbol only ( which is always come in the secound place in the string) and delet every thing else

Yamen Wm
  • 13
  • 2
  • Please do share your `gsub()` attempt, if you were able to delete that specific (correct, I assume) element, you just need to fiddle with groups a bit to only keep that. – margusl Feb 06 '23 at 08:24
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 06 '23 at 10:37

1 Answers1

1

If your data is consistently in the format shown in the image (where the gene ID is always the third "word" of the string), then the word() function from the stringr package can extract the data you want.

library(stringr)

dat = data.frame(gene_assignment = rep(c('idnumbers // geneID // Other stuff'),10))

dat$geneID = word(dat$gene_assignment, 3)

Note that this makes the following assumptions:

  1. Your data is always in the format where there are some id numbers, followed by " // ", followed by the gene ID, followed by a space, and then anything else
  2. Neither the ID numbers in the front nor the gene ID ever contain a space in them

These assumptions are necessary because word() uses spaces to determine when each word starts and ends.

NickCHK
  • 1,093
  • 7
  • 17
  • If those assumptions *are* violated, by the way, then you'll have to stitch together some stringr stuff using str_locate and str_sub to (1) str_locate to find the position of the first // in each row, (2) str_sub to the spot right after that location you found, overwriting the original string, (3) repeat step 1 on the new string, (4) str_sub to cut off everything from the // onwards, (5) str_trim to get rid of whitespace – NickCHK Feb 06 '23 at 08:11
  • thanks! this code just worked perfectly dat$geneID = word(dat$gene_assignment, 3) – Yamen Wm Feb 06 '23 at 08:28
  • Great! Feel free to mark this answer as Correct so that anyone else coming along will know it works. – NickCHK Feb 06 '23 at 20:14