-2

I have a big text file ("|" separated) like this small example:

>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|RNF14|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|HAS-0|HAS|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

the first line is ID line starting with "<" and the 2nd line is the sequence of characters belong the above ID line. looking at the 6th column there are repeated names and 7th length is the length of the line after the ID (sequence of characters). I want to select one repeat of each ID line according to the 7th column which means the ID with the longest length. expected output for the small example would be:

>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

so there is one repeat of each ID line (looking at the column 6) according to the length which is column 7. I tried the following code in python but it does not work. do you know how to fix it?

from __future__ import print_function
import sys

def parse_fasta(data):
    name, seq = None, []
    for line in data:
        line = line.rstrip()
        if line.startswith('>'):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
    geneid = '.'.join(defline[1:].split('.')[:-1])
    if geneid in isoforms:
        otherdefline, othersequence = isoforms[geneid]
        if len(sequence) > len(othersequence):
            isoforms[geneid] = (defline, sequence)
    else:
        isoforms[geneid] = (defline, sequence)

for defline, sequence in isoforms.values():
    print(defline, sequence, sep='\n')
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
user3631908
  • 23
  • 1
  • 8
  • You're trying to match IDs in the sixth column, why not just do: geneid = defline.split('|')[5] – Ghoti Jun 11 '18 at 16:13

1 Answers1

1

I suggest you use Biopython for this, rather than building your own parser. Note I've also added a sanity check (in your case the FASTA header line ending 474 actually has sequence of length just 388):

from Bio import SeqIO

def yield_records():
    seen = set()
    for record in SeqIO.parse('in.fa', 'fasta'):
        header_seq_len = int(record.description.split('|')[-1])
        seq_len = len(record)
        if header_seq_len != seq_len:
            print('Warning: the seq length {} != that stated in the header {}'
                   .format(seq_len, header_seq_len))
        if header_seq_len not in seen:
            yield record
            seen.add(header_seq_len)

SeqIO.write(yield_records(), 'out.fa', 'fasta')
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119