Given that I have a file of N size. For the sake of example 30GB file.
Facts about the file content is that it has proprotional amount of lines. This is interleaved FastQ file. (not important for the question but usefull for someone)
File content is paired
or interleaved
DNA sequence of strings. Each pair
is 8
lines long.
I want to process the interleaved FastQ with GNU parallel
in order to speed up the process.
Reason for using parallel
instead of native bwa
tool threads feature is that parallel
helps to reduce amount of RAM needed because the nature of bwa
memory allocation.
Given that interleaved file is 30GB of size I want to process chunks
of --block 500M
, command line params looks like --pipe --block 500M -L 8 -j 10
this then is sent as stdin
to bwa
and will run 10 bwa
tasks each getting 500M
chunks with a record
of 8
lines.
Is my assumption correct that --block 500M
and -L 8
will be managed by parallel
and I can be certain that my bwa
tool will always get 8
lines times N MB
of data?
What I am not clear is, will parallel
"repeat" last "chunk" if 8
lines are not present?
And will it apropriatelly controll other chunk inputs for N processes
I start with parallel
?
Or this --block 500M
"blindly" sends 500M chunk to single process regardless if last part of the 500M chunk does not contain 8 lines
so to speak?
Update:
After whole day reading questions and answers on biostars and seqanswers I've realised that my testing/"benchmarking" was wrong.
But this helped to realise that I need to update the question and will make separate question.
I was testing inside Docker container which by default has very low /dev/shm
thus I have mislead my self to go totaly different path.