I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
So lines starting with an '>' are header lines containing an identifier for the sequence following the identifier.
I suppose you load the entire file to memory but after this i am having trouble finding a way to process these data.
The problem is: Threads can not start at an arbitrary position because they could cut sequences this way.
Does someone has any experience in parsing files in parallel when the lines depend on each other? Any idea is appreciated.