-2

I have a data frame that has a column containing the chromosome details (1 to 22). I would like to create another column with only Chr numbers enter image description here

Mahan
  • 71
  • 1
  • 6

3 Answers3

2

Without a reproducible example it will be hard to answer. Using stringr package and regex you may achieve what you are searching for but you need to know all possibilities. Maybe if there is only underscore between what you want and annoying information, you can solve your problem using str_split and "_" as pattern parameter. Please refer to https://stackoverflow.com/help/how-to-ask

library(stringr)
df <- data.frame(chromosome = c("chr6_GL000253v2_alt", "chr6_GL000254v2_alt",
                                "chr6_GL000255v2_alt", "chr6_GL000256v2_alt", "chr4", "chr11",
                                "chr8", "chr12", "chr2", "chr12", "chr4", "chr6", "chr15", "chr4",
                                "chr2"))
df$chromosome_fixed=str_split(df$chromosome,"_",simplify = T)[,1]
Bast_B
  • 143
  • 6
  • While I agree, I think this should be a comment instead of an answer. You could actually create a small example and show your (working) solution for this problem. – Martin Gal Oct 12 '21 at 14:37
  • 1
    Thanks Martin, you are right, I edited my answer accordingly – Bast_B Oct 12 '21 at 14:41
  • Bast & Martin, thank you very much. This helped me for sure. The new column has Chr in that like Chr1, Chr14 etc --> I just want to get new column with numbers 1,14,2,16 etc – Mahan Oct 13 '21 at 09:54
  • You can remove "chr" by doing a `substring` substring(df$chromosome_fixed ,3) – Bast_B Oct 13 '21 at 14:34
2

Please find below a solution with the package data.table:

REPREX

  • Code
library(data.table)
library(stringr)

DT[, Chr_ID := lapply(.SD, str_extract,"(?<=^chr)\\d+"), .SDcols = "chromosome"]
  • Output
DT
#>              chromosome Chr_ID
#>  1: chr6_GL000253v2_alt      6
#>  2: chr6_GL000254v2_alt      6
#>  3: chr6_GL000255v2_alt      6
#>  4: chr6_GL000256v2_alt      6
#>  5:                chr4      4
#>  6:               chr11     11
#>  7:                chr8      8
#>  8:               chr12     12
#>  9:                chr2      2
#> 10:               chr12     12
#> 11:                chr4      4
#> 12:                chr6      6
#> 13:               chr15     15
#> 14:                chr4      4
#> 15:                chr2      2
  • Your data
DT <- data.table(chromosome = c("chr6_GL000253v2_alt", "chr6_GL000254v2_alt",
                 "chr6_GL000255v2_alt", "chr6_GL000256v2_alt", "chr4", "chr11",
                 "chr8", "chr12", "chr2", "chr12", "chr4", "chr6", "chr15", "chr4",
                 "chr2"))
DT
#>              chromosome
#>  1: chr6_GL000253v2_alt
#>  2: chr6_GL000254v2_alt
#>  3: chr6_GL000255v2_alt
#>  4: chr6_GL000256v2_alt
#>  5:                chr4
#>  6:               chr11
#>  7:                chr8
#>  8:               chr12
#>  9:                chr2
#> 10:               chr12
#> 11:                chr4
#> 12:                chr6
#> 13:               chr15
#> 14:                chr4
#> 15:                chr2

Created on 2021-10-12 by the reprex package (v2.0.1)

lovalery
  • 4,524
  • 3
  • 14
  • 28
  • You are just extracting the 4th character, so `chr11` is transformed into `1`. I doubt this is a correct solution. – Martin Gal Oct 12 '21 at 14:33
  • 1
    Oops! You are of course right @Martin Gal. Sorry about that. I just edited my answer based on your previous comment. Thanks again for your feedback – lovalery Oct 12 '21 at 14:55
-1

Since you didn't share the data. I've created similar column and extracted the numbers to a new column called Number:

#Populate a dummy table
df = pd.DataFrame(data=['chr6_GL','chr6_GL00','chr4','chr11','chr8','chr12'], columns=['Data'])
#Extract the numbers using regex and assign it to a new column called 'Number'
df['Numbers']=df['Data'].str.extract(r'chr([0-9]*)')

Data Numbersenter image description here

pyzer
  • 122
  • 7