2

I have a BLAST outfmt 6 output file in standard format, I want to find a way to loop through the file, select each hit, find its reciprocal hit and decipher which is the best hit to store.

For example:

d = {}
for line in input_file:
    term = line.split('\t')
    qseqid = term[0]
    sseqid = term[1]
    hit = qseqid, sseqid
    recip_hit = sseqid, qseqid
    for line in input_file:
        if recip_hit in line:
            compare both lines
done

Example input (tab delimited):

Seq1    Seq2    80    1000   10    3   1    1000    100    1100    0.0    500
Seq2    Seq1    95    1000   10    3   100    1100    1    1000    1e-100    500

Can anyone provide any insight how to efficiently tackle this problem?

Many thanks in advance

Gloom
  • 317
  • 1
  • 3
  • 13
  • My apologies, will add input now, no, just looking for some direction – Gloom Feb 06 '18 at 13:51
  • I assume in this context that each pair, "Seq1, Seq2" and it's reverse counterpart, can not occur more than once in your input file, right? – Mr. T Feb 06 '18 at 14:21
  • Not in this dataset anyway – Gloom Feb 06 '18 at 14:22
  • apologies - in this case it is a list of lists, however an open file would be a much better approach in my opinion – Gloom Feb 06 '18 at 14:37
  • You should update your question with the additional information. People tend not to read comments. And imho reading directly from the file is not a good idea. Until the very last line, you don't know, which lines you can discard, because there is no counterpart. So you have to read the file twice - once to make the index of matching pairs and once to read the lines for those matching pairs. – Mr. T Feb 06 '18 at 14:50

1 Answers1

2

You could approach your problem to find those pairs and compare the lines like this:

#create a dictionary to store pairs
line_dict = {}
#iterate over your file
for line in open("test.txt", "r"):
    line = line[:-1].split("\t")
    #ignore line, if not at least one value apart from the two sequence IDs
    if len(line) < 3:
        continue
    #identify the two sequences
    seq = tuple(line[0:2])
    #is reverse sequence already in dictionary?
    if seq[::-1] in line_dict:
        #append new line
        line_dict[seq[::-1]].append(line)
    else:
        #create new entry
        line_dict[seq] = [line]

#remove entries, for which no counterpart exists
pairs = {k: v for k, v in line_dict.items() if len(v) > 1}

#and do things with these pairs
for pair, seq in pairs.items():
    print(pair, "found in:")
    for item in seq:
        print(item)

The advantage is that you only have to iterate once over your file, because you store all data and discard them only, if you haven't found a matching reversed pair. The disadvantage is that this takes space, so for very large files, this approach might not be feasible.

A similar approach - to store all data in your working memory - utilises pandas. This should be faster, since sorting algorithms are optimised for pandas. Another advantage of pandas is that all your other values are already in pandas columns - so further analysis is made easier. I definitely prefer the pandas version, but I don't know, if it is installed on your system. To make things easier to communicate, I assigned a and b to the columns that contain the sequences Seq1 and Seq2.

import pandas as pd
#read data into a dataframe
#not necessary: drop the header of the file, use custom columns names
df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)

#create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
#remove rows with no matching pair and sort the database
only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")

print(only_pairs)
Mr. T
  • 11,960
  • 10
  • 32
  • 54