I have FASTA files, from an in vitro SELEX experiment. All reads should in theory start with the same 6 bases (core seq: GCTGCT) and be of equal length - 27nt. In reality, some reads start even 10 bases later with the core sequence and continue with the rest of the 21 nt. I would like to extract the sequences with the core sequence, regardless of where in the read the core seq. starts and then crop the reads to 27nt.
Example (the region in bold is the region i would like to extract:
read1 GCTGCTTTTTCGCTTTCCTTGCGGCCAAAA
read2 GACGTGTGCTGCTGCTAATTTGCTTTCCTTGTCCATGAA
here read1 starts with the core sequence and need to be cropped to 27. this part is easy to do. the problem is in read2 where the core sequence starts later and i cannot crop it to 27nt directly, but 27nt after the start of the core sequence. I would like the output to be in FASTA format.
Does anyone know of a tool that could do that or has other suggestions?