First of all, sorry about the extensive size of the title, i could't find a better what to explain where i want to get to with this bash script.
I have a very large file (multifasta) that looks like this:
>NAME1
GATATATAGATTAGATTTAGAGAGAGGAGCTATTCATCAGAGCTATCATCAGCTACAGCA
>NAME2
GCGCTAGAGAGCTAGCTACGACTAGCACTAGAGGATACATCATGGGTCATCAGCAGTCAGCATCAC
>NAME3
GCATCAGCATGATAGATCTCATGACTAGATAGAACTATCAT
and goes on....
I also have two patterns:
'GATA' and 'TCAT'
I already know that those 2 patterns exist in every line that doesn't begin with '>', sometimes more than once. So, my objective is to print the '>' line and then get the distance between all the combination of the two patterns in the next line to it, like this:
>NAME1
29 #distance between the only 'GATA' and the first 'TCAT'
41 #distance between the only 'GATA' and the second 'TCAT'
>NAME2
2 #distance between the only 'GATA' and the first 'TCAT'
9 #distance between the only 'GATA' and the second 'TCAT'
>NAME3
4 #distance between the first 'GATA' and first 'TCAT'
23 #distance between the first 'GATA' and second 'TCAT'
6 #distance between the second 'GATA' and the second 'TCAT'
In the third block, there is no distance between the second 'GATA' and the first 'TCAT' because the second pattern appears before the first pattern.
I tried the following code:
while IFS= read -r line;
do
echo $line;
if [[ "$line" == ">"* ]];
then
echo $line;
else
count=$(sed -n /GATA/,/TCAT/p' | wc -c);
echo $count;
fi
done < $file
That gives me the following output:
>NAME1
3029
That output gives me just the first '>' line and a really weird and wrong distance between my two patterns, that suggest that i might be doing at least two things wrong, the loop itself and the sed command.
I'm sorry if this was a confusing post and i will be here to clarify things if necessary. I will appreciate any help i can get, or tips or useful links.
Thank you all,