1

I have FASTA files, from an in vitro SELEX experiment. All reads should in theory start with the same 6 bases (core seq: GCTGCT) and be of equal length - 27nt. In reality, some reads start even 10 bases later with the core sequence and continue with the rest of the 21 nt. I would like to extract the sequences with the core sequence, regardless of where in the read the core seq. starts and then crop the reads to 27nt.

Example (the region in bold is the region i would like to extract:

read1 GCTGCTTTTTCGCTTTCCTTGCGGCCAAAA

read2 GACGTGTGCTGCTGCTAATTTGCTTTCCTTGTCCATGAA

here read1 starts with the core sequence and need to be cropped to 27. this part is easy to do. the problem is in read2 where the core sequence starts later and i cannot crop it to 27nt directly, but 27nt after the start of the core sequence. I would like the output to be in FASTA format.

Does anyone know of a tool that could do that or has other suggestions?

Fluorine
  • 55
  • 9
  • 1
    I'm not familiar with these files, but there's an r package I found to read/write them. https://www.rdocumentation.org/packages/seqinr/versions/4.2-5/topics/read.fasta. String extraction in R can be done with the `stringr` package. 1. define your string `patt<-("GCTGCT(.....................)")` then use `str_extract(data$sequence_col,patt)` – Pake Mar 23 '21 at 16:19
  • I made some sample dataframe to experiment using your example, and noticed that read2 has GCTGCTGCT, with the second two repetitions of "GCT" beginning your highlighted sequence. Is such a case possible, and if so, what would you want to happen? Also, some sample data would be helpful if you can provide. – Pake Mar 23 '21 at 16:22
  • i would like the the first occurrence (from left - 5') of GCTGCT to count as start of the read and extract GCTGCT+21nt and disregard any other occurrence of GCTGCT later in the read. Your solution works, in the sense that i do get the correct sequences back. The only problem is that now i have lost the header structure of the FASTA file. Ideally i would need the output to be in FASTA format or retain the information on the headers. – Fluorine Mar 24 '21 at 09:42
  • happy to take a stab, but would need an example file to play with. Is the sequence you're extracting replacing an existing column? If so, `data$existing_column<-str_extract(data$sequence_col,patt)` – Pake Mar 24 '21 at 14:44

0 Answers0