0

I have a data.frame called rbp that contains a single column like following:

 >rbp
          V1
    dd_smadV1_39992_0_1
    Protein: AGBT(Dm)
    Sequence Position
    234
    290
    567
    126
    Protein: ATF1(Dm)
    Sequence Position
    534
    890
    105
    34
    128
    301
    Protein: Pox(Dm)
    201
    875
    453
    *********************
    dd_smadv1_9_02
    Protein: foxc2(Mm)
    Sequence Position
    145
    987
    345
    907
    Protein: Lor(Hs)
    876
    512

I would like to discard the Sequence position and extract only the specific details like the names of the sequence and the corresponding protein names like following:

dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)
dd_smadv1_9_02 foxc2(Mm);Lor(Hs)  

I tried the following code in R but it failed:

library(gsubfn)
Sub(rbp$V1,"Protein:(.*?) ")

Could anyone guide me please.

Carol
  • 367
  • 2
  • 3
  • 18

1 Answers1

1

Here's one way to to it:

m <- gregexpr("Protein: (.*?)\n", x <- strsplit(paste(rbp$V1, collapse = "\n"), "*********************", fixed = TRUE)[[1]])
proteins <- lapply(regmatches(x, m), function(x) sub("Protein: (.*)\n", "\\1", x))
names <- sub(".*?([A-z0-9_]+)\n.*", "\\1", x)
sprintf("%s %s", names, sapply(proteins, paste, collapse = ";"))
# [1] "dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)"
# [2] "dd_smadv1_9_02 foxc2(Mm);Lor(Hs)
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Thanks for your reply. It worked but I would like have the ids one column and corresponding protein names in another column. Also there are empty line between the two enrties – Carol Mar 03 '15 at 14:19
  • You should be able to get two columns by using `read.table(text = sprintf("%s %s", names, sapply(proteins, paste, collapse = ";")))`. I can't reproduce the empty lines with the data you provided, but I'm sure you can exclude them easily. – lukeA Mar 03 '15 at 14:24