4

I have text file like this small example:

>ENST00000491024.1|ENSG00000187583.6|OTTHUMG00000040756.4|OTTHUMT00000097942.2|PLEKHN1-003|PLEKHN1|176
SLESSPDAPDHTSETSHSPLYADPYTPPATSHRRVTDVRGLEEFLSAMQSARGPTPSSPLPSVPVSVPASDPRSCSSGPAGPYLLSKKGALQSRAAQRHRGSAKDGGPQPPDAPQLVSSAREGSPEPWLPLTDGRSPRRSRDPGYDHLWDETLSSSHQKCPQLGGPEASGGLVQWI
>ENST00000433179.2|ENSG00000187642.5|OTTHUMG00000040757.3|-|C1orf170-201|C1orf170|696
MPTQDGQLRRPARPPGPRAWMEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000341290.2|ENSG00000187642.5|OTTHUMG00000040757.3|OTTHUMT00000097943.2|C1orf170-001|C1orf170|676
MEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000428771.2|ENSG00000188290.6|OTTHUMG00000040758.2|OTTHUMT00000097945.2|HES4-002|HES4|247
MAADTPGKPSASPMAGAPASASRTPDKPRSAAEHRKVGSRPGVRGATGGREGRGTQPVPDPQSSKPVMEKRRRARINESLAQLKTLILDALRKESSRHSKLEKADILEMTVRHLRSLRRVQVTAALSADPAVLGKYRAGFHECLAEVNRFLAGCEGVPADVRSRLLGHLAACLRQLGPSRRPASLSPAAPAEAPAPEVYAGRPLLPSLGGPFPLLAPPLLPGLTRALPAAPRAGPQGPGGPWRPWLR

This file is splitted into different groups. Each group has 2 parts. The 1st part starts with ">" and the elements in this part are splitted by "|" and the line after that is the 2nd part. I am trying to make a list in Python from my file which has the 6th element of the ID part of each group. Here is the expected output for the small example:

list = ["PLEKHN1", "C1orf170", "C1orf170", "HES4"]

I am trying to first import into a dictionary and then make a list like expected output using:

from itertools import groupby
with open('infile.txt') as f:
    groups = groupby(f, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

k = d.keys()
res = [el[5:] for s in k for el in s.split('|')]

But it does not return what I am looking for. Do you know how to fix it?

zx8754
  • 52,746
  • 12
  • 114
  • 209
john
  • 263
  • 1
  • 9
  • "it does not return what I am looking for" can you explain in your question what you're getting? – Jean-François Fabre May 30 '18 at 13:05
  • Can we assume that the file is exactly like your sample or did you modify it in order to show it to us? – HitLuca May 30 '18 at 13:06
  • This looks much too complicated. The desired output can be obtained by something as simple as `[line.split('|')[5] for line in f if line.startswith('>')]`. Is that what you need or is there a further complication you didn't mention? – mkrieger1 May 30 '18 at 13:10
  • You should probably comment your code. For instance, what is the purpose of `if not k`? – bli May 31 '18 at 14:08

3 Answers3

10

Since these are clearly protein sequences in FASTA format, I suggest you use Biopython, it will save you time and be more robust than building your own parser:

from Bio import SeqIO

lst = [record.description.split('|')[5] for record in SeqIO.parse('in_file.fasta', 'fasta')]

print(lst)
# ['PLEKHN1', 'C1orf170', 'C1orf170', 'HES4']
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
0

Try this: res = [s[5] for s in [el.split('|') for el in k]]

output: ['HES4', 'C1orf170', 'PLEKHN1', 'C1orf170']

gyx-hh
  • 1,421
  • 1
  • 10
  • 15
-1

You can get the tokens you want by reading every line in your file and selecting only the lines that start with '>'. Then you split the results based on the '|' character and take the 6th element. This code does that in one line

  with open('infile.txt') as f:
        tokens =[line.split('|')[5] for line in f.readlines() if line[0] == '>']
  print(tokens)
HitLuca
  • 1,040
  • 9
  • 33
  • please read [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Provide more explanation on your answer. – Narendra Jadhav May 30 '18 at 16:24