Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file

Question

First of all, sorry about the extensive size of the title, i could't find a better what to explain where i want to get to with this bash script.

I have a very large file (multifasta) that looks like this:

>NAME1
GATATATAGATTAGATTTAGAGAGAGGAGCTATTCATCAGAGCTATCATCAGCTACAGCA
>NAME2
GCGCTAGAGAGCTAGCTACGACTAGCACTAGAGGATACATCATGGGTCATCAGCAGTCAGCATCAC
>NAME3
GCATCAGCATGATAGATCTCATGACTAGATAGAACTATCAT

and goes on....

I also have two patterns:

'GATA' and 'TCAT'

I already know that those 2 patterns exist in every line that doesn't begin with '>', sometimes more than once. So, my objective is to print the '>' line and then get the distance between all the combination of the two patterns in the next line to it, like this:

>NAME1
29 #distance between the only 'GATA' and the first 'TCAT'
41 #distance between the only 'GATA' and the second 'TCAT'
>NAME2
2 #distance between the only 'GATA' and the first 'TCAT'
9 #distance between the only 'GATA' and the second 'TCAT'
>NAME3
4 #distance between the first 'GATA' and first 'TCAT'
23 #distance between the first 'GATA' and second 'TCAT'
6 #distance between the second 'GATA' and the second 'TCAT'

In the third block, there is no distance between the second 'GATA' and the first 'TCAT' because the second pattern appears before the first pattern.

I tried the following code:

while IFS= read -r line;
do
        echo $line;
        if [[ "$line" == ">"* ]];
        then
                echo $line;
        else
                count=$(sed -n /GATA/,/TCAT/p' | wc -c);
                echo $count;
        fi
done < $file

That gives me the following output:

>NAME1
3029

That output gives me just the first '>' line and a really weird and wrong distance between my two patterns, that suggest that i might be doing at least two things wrong, the loop itself and the sed command.

I'm sorry if this was a confusing post and i will be here to clarify things if necessary. I will appreciate any help i can get, or tips or useful links.

Thank you all,

I would suggest you, approaching to problem with creating smaller parts of the actual problem. You'll be going from there testing each part and making sure that as you approach the final line, you'll be confident about the parts of your script. Smaller goals would be: (1) Create a script that would tell the difference between two points of a string. (2) Now that you're confident about it, now it would be a time fora script to find the points. (3) would be to combine both to see whether they work. (4) Now it's time for a script to find many points in a line. (5) A script to check all lines..etc — Emirhan Özlen, Jan 12 '18 at 20:00
`sed` does not have an input, thus using stdin, and taking it away from the loop. `/GATA/,/TCAT/` is an address for the command `p`, it is a inclusive range of lines . It is not working as a range within the line. `sed` is not good in non-greedy matches, maybe use something like `perl -pe 's/.*?(GATA.*?TCAT).*?/ \1 /g'` and then check the size of the words found. That would only get the smallest distances. — ULick, Jan 12 '18 at 22:14

score 1 · Answer 1 · answered Jan 12 '18 at 22:24

1

Following awk may help you in same. Also not sure how come 6 has come on output in NAME3 since string TCAT is present only 2 times in line after NAME3.

awk -v pattern1="GATA"  -v pattern2="TCAT"  '
/^>/{
  print;
  index2=prev="";
  next
}
{
  val=$0;
  while(index(val,pattern2)){
    index2=prev?prev+length(pattern2)+index(val,pattern2):index(val,pattern2);
    print index2-(index($0,pattern1)+length(pattern1));
    val=substr(val,index(val,pattern2)+length(pattern2)+1);
    prev=index2
}}
'   Input_file

Will add explanation too shortly.

answered Jan 12 '18 at 22:24

RavinderSingh13

130,504
14
57
93

1

Thank you very much for your reply and help! Your script is very good and works pretty nice. Is there a way to improve it to calculate also the distances between a second appearance of the pattern 'GATA' in NAME3 (first GATA at 10th position and second one at 27th position in line after NAME3) and the pattern 'TCAT'? Thank you very much Ravinder and i look forward to your explanation :) – Fernanda Costa Jan 12 '18 at 23:30

Get the number of characters between two patterns that are in the next line of a line that begins with '>' in a large file

1 Answers1