Parsing file in parallel

Question

I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example:

>SEQUENCE_1  
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG  
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK  
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL  
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL  
>SEQUENCE_2  
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI  
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

So lines starting with an '>' are header lines containing an identifier for the sequence following the identifier.

I suppose you load the entire file to memory but after this i am having trouble finding a way to process these data.

The problem is: Threads can not start at an arbitrary position because they could cut sequences this way.

Does someone has any experience in parsing files in parallel when the lines depend on each other? Any idea is appreciated.

you could also ask http://biostar.stackexchange.com/ – Pierre Nov 24 '11 at 16:39 — Pierre, Nov 24 '11 at 16:39

score 2 · Accepted Answer · answered Nov 27 '11 at 20:23

Should be easy enough, since the dependence of lines on each other is very simple in this case: just make the threads start in an arbitrary position and then just skip the lines until they get to one that starts with a '>' (i.e. starts a new sequence).

To make sure no sequence gets processed twice, keep a set of all sequence IDs that have been processed (or you could do it by line number if the sequence IDs aren't unique, but they really should be!).

score 1 · Answer 2 · answered Nov 24 '11 at 15:21

1

Do a preprocessing step, walk through the data once, and determine all valid start points. Let's call these tasks. Then you can simply use a worker-crew model, where each worker repeatedly asks for a task (a starting point), and parses it.

answered Nov 24 '11 at 15:21

Yuri

2,008
17
36

To do the preprocessing, you've basically read the files, which probably dominates the processing time. – Ira Baxter Nov 24 '11 at 15:24
1

@Ira Baxter possibly. I do not really know to what extent the file is being "parsed". If a sequence is parsed as a simple array, or list, or some similar data structure you would indeed not gain anything from this. However, in that case I doubt that you will win something by parallellising it anyway. – Yuri Nov 24 '11 at 15:25

Parsing file in parallel

2 Answers2