I have a data frame that has a column containing the chromosome details (1 to 22). I would like to create another column with only Chr numbers
Asked
Active
Viewed 233 times
3 Answers
2
Without a reproducible example it will be hard to answer. Using stringr
package and regex
you may achieve what you are searching for but you need to know all possibilities. Maybe if there is only underscore between what you want and annoying information, you can solve your problem using str_split
and "_" as pattern parameter.
Please refer to https://stackoverflow.com/help/how-to-ask
library(stringr)
df <- data.frame(chromosome = c("chr6_GL000253v2_alt", "chr6_GL000254v2_alt",
"chr6_GL000255v2_alt", "chr6_GL000256v2_alt", "chr4", "chr11",
"chr8", "chr12", "chr2", "chr12", "chr4", "chr6", "chr15", "chr4",
"chr2"))
df$chromosome_fixed=str_split(df$chromosome,"_",simplify = T)[,1]

Bast_B
- 143
- 6
-
While I agree, I think this should be a comment instead of an answer. You could actually create a small example and show your (working) solution for this problem. – Martin Gal Oct 12 '21 at 14:37
-
1
-
Bast & Martin, thank you very much. This helped me for sure. The new column has Chr in that like Chr1, Chr14 etc --> I just want to get new column with numbers 1,14,2,16 etc – Mahan Oct 13 '21 at 09:54
-
You can remove "chr" by doing a `substring` substring(df$chromosome_fixed ,3) – Bast_B Oct 13 '21 at 14:34
2
Please find below a solution with the package data.table
:
REPREX
- Code
library(data.table)
library(stringr)
DT[, Chr_ID := lapply(.SD, str_extract,"(?<=^chr)\\d+"), .SDcols = "chromosome"]
- Output
DT
#> chromosome Chr_ID
#> 1: chr6_GL000253v2_alt 6
#> 2: chr6_GL000254v2_alt 6
#> 3: chr6_GL000255v2_alt 6
#> 4: chr6_GL000256v2_alt 6
#> 5: chr4 4
#> 6: chr11 11
#> 7: chr8 8
#> 8: chr12 12
#> 9: chr2 2
#> 10: chr12 12
#> 11: chr4 4
#> 12: chr6 6
#> 13: chr15 15
#> 14: chr4 4
#> 15: chr2 2
- Your data
DT <- data.table(chromosome = c("chr6_GL000253v2_alt", "chr6_GL000254v2_alt",
"chr6_GL000255v2_alt", "chr6_GL000256v2_alt", "chr4", "chr11",
"chr8", "chr12", "chr2", "chr12", "chr4", "chr6", "chr15", "chr4",
"chr2"))
DT
#> chromosome
#> 1: chr6_GL000253v2_alt
#> 2: chr6_GL000254v2_alt
#> 3: chr6_GL000255v2_alt
#> 4: chr6_GL000256v2_alt
#> 5: chr4
#> 6: chr11
#> 7: chr8
#> 8: chr12
#> 9: chr2
#> 10: chr12
#> 11: chr4
#> 12: chr6
#> 13: chr15
#> 14: chr4
#> 15: chr2
Created on 2021-10-12 by the reprex package (v2.0.1)

lovalery
- 4,524
- 3
- 14
- 28
-
You are just extracting the 4th character, so `chr11` is transformed into `1`. I doubt this is a correct solution. – Martin Gal Oct 12 '21 at 14:33
-
1Oops! You are of course right @Martin Gal. Sorry about that. I just edited my answer based on your previous comment. Thanks again for your feedback – lovalery Oct 12 '21 at 14:55
-1
Since you didn't share the data. I've created similar column and extracted the numbers to a new column called Number:
#Populate a dummy table
df = pd.DataFrame(data=['chr6_GL','chr6_GL00','chr4','chr11','chr8','chr12'], columns=['Data'])
#Extract the numbers using regex and assign it to a new column called 'Number'
df['Numbers']=df['Data'].str.extract(r'chr([0-9]*)')

pyzer
- 122
- 7