3

I would like to develop these expressions that are in this form:

a <- "[AGAT]5GAT[AGAT]7[AGAC]6AGAT"

I would like to convert the expression like this:

b <- "AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

As you can see, the number after the hook means the number of times the pattern is found.

For the moment I use sub(".*[*(.*?) *].*", "\\1", seq) for select character between [] and replicate(i, "my_string") for replicate sequence between [] but I do not find how to make it work with my data.

I hope to be pretty clear.

zx8754
  • 52,746
  • 12
  • 114
  • 209
yach
  • 43
  • 6
  • Could you please your expected output. If the letters inside the `[]` needs to be replicated, it looks different – akrun Feb 20 '18 at 09:59
  • Looks like it would have been much easier if you could get something like `a=[AGAT]5[GAT]1[AGAT]7[AGAC]6[AGAT]1` instead from the process that generates your expressions. This is a more consistent format. – AntoniosK Feb 20 '18 at 10:16
  • @AntoniosK I totally agree with you but biologists have their logic ... but your solution is what we must understand – yach Feb 20 '18 at 10:20

2 Answers2

3

We use gsub to create 1s where there is no number before the [ ('a1'), then extract the letters and numbers separately ('v1', 'v2'), do the replication with strrep and paste the substrings to a single string ('res')

library(stringr)
a1 <- gsub("(?<![0-9])\\[", "1[", a, perl = TRUE)
v1 <- str_extract_all(a1, '[A-Z]+')[[1]]
v2 <- str_extract_all(a1, "[0-9]+")[[1]]
res <- paste(strrep(v1, as.numeric(c(tail(v2, -1), v2[1]))), collapse='')
res

-output

#[1] "AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

-checking with the 'b'

identical(res, b)
#[1] TRUE

A slightly more compact regex would be to change the first step

a1 <- gsub("(?<=[A-Z])(?=\\[)|(?<=[A-Z])$", "1", a, perl = TRUE)
v1 <- str_extract_all(a1, '[A-Z]+')[[1]]
v2 <- str_extract_all(a1, "[0-9]+")[[1]]
res1 <- paste(strrep(v1, as.numeric(v2)), collapse="")
identical(res1, b)
#[1] TRUE

data

a <- '[AGAT]5GAT[AGAT]7[AGAC]6AGAT'
b <- 'AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT'
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • it does not give the correct output. GAT (second position) and AGAT (last position) are missing. – DJack Feb 20 '18 at 10:01
  • @DJack I am replicating the letters inside the `[]`. Which one do you want? Is it inside or outside? – akrun Feb 20 '18 at 10:02
  • 1
    Thank you for your help, however it is not the right solution because the output is not good. The numbers between brackets must be repeated the number of times indicated by the number after the brackets. When there is no brackets then there is no repetition. – yach Feb 20 '18 at 10:04
  • @yach It is confusing. For example `[AGAT]5` it seems like AGAT is replicated 5 times, then then next `[AGAT]7` also replicated 7 ? – akrun Feb 20 '18 at 10:05
  • here is a more readable output : AGATAGATAGATAGATAGAT GAT AGATAGATAGATAGATAGATAGATAGAT AGACAGACAGACAGACAGACAGAC AGAT – yach Feb 20 '18 at 10:06
  • @akrun it does not work with all my data. This is not your fault, it comes sometimes from my data that vary for example there are times when we have hooks without number after. thanks for your help – yach Feb 20 '18 at 10:31
2

Try this:

a<-"[AGAT]5GAT[AGAT]7[AGAC]6AGAT"

   list<-unlist(strsplit(unlist(strsplit(a,"\\]")),"\\["))

   number<-suppressWarnings(as.numeric(gsub("([0-9]+).*$", "\\1", list)))
   number[is.na(number)]<-1  
   seq<-gsub('[0-9]+', '', list)

   out<-paste(rep(seq[2:(length(seq))],number[c(3:length(number),2)]),collapse = '')

 b="AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

out==b
[1] TRUE

The output is correct, but I don't know if is a general solution for every kind of data in input

Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39