1

Given that I have a file of N size. For the sake of example 30GB file.

Facts about the file content is that it has proprotional amount of lines. This is interleaved FastQ file. (not important for the question but usefull for someone)

File content is paired or interleaved DNA sequence of strings. Each pair is 8 lines long.

I want to process the interleaved FastQ with GNU parallel in order to speed up the process. Reason for using parallel instead of native bwa tool threads feature is that parallel helps to reduce amount of RAM needed because the nature of bwa memory allocation.

Given that interleaved file is 30GB of size I want to process chunks of --block 500M, command line params looks like --pipe --block 500M -L 8 -j 10 this then is sent as stdin to bwa and will run 10 bwa tasks each getting 500M chunks with a record of 8 lines.

Is my assumption correct that --block 500M and -L 8 will be managed by parallel and I can be certain that my bwa tool will always get 8 lines times N MB of data?

What I am not clear is, will parallel "repeat" last "chunk" if 8 lines are not present? And will it apropriatelly controll other chunk inputs for N processes I start with parallel?

Or this --block 500M "blindly" sends 500M chunk to single process regardless if last part of the 500M chunk does not contain 8 lines so to speak?

Update:

After whole day reading questions and answers on biostars and seqanswers I've realised that my testing/"benchmarking" was wrong.

But this helped to realise that I need to update the question and will make separate question.

I was testing inside Docker container which by default has very low /dev/shm thus I have mislead my self to go totaly different path.


titus
  • 377
  • 4
  • 16
  • parallel is a cook tool but _if_ your need to manage the fastsq, the mapping, the downstream BAM files, VCF files, you should use a workflow manager (nextflow, snakemake). – Pierre Aug 06 '21 at 17:39
  • 1
    @Pierre Yes it is clear that processing in Genomics requires "pipeline framework". But this question was not about Genomics but about `parallel` and the way it works. – titus Aug 07 '21 at 01:02

2 Answers2

1

Yes, you can be certain.

The --block parameter is described here: https://www.gnu.org/software/parallel/parallel_tutorial.html#chunk-size

The -L parameter here: https://www.gnu.org/software/parallel/parallel_tutorial.html#records

Quick summary: Parallel will always send full lines to each process until the block/buffer capacity is filled. If you specify a that one record requires several lines (8 in your case), it will fill the buffer capacity in chunks of 8 lines each.

The last block can be smaller than 8 lines, if there are fewer remaining.

Side note: In the case of properly formatted and interleaved fastq files, there will always be 8 lines. fastq format specifies that each record is 4 lines and paired-end fastq files must contain the same number of records.

Jonas
  • 26
  • 2
0

@Jonas is right.

But be aware that -L is rather slow: GNU Parallel has to examine every single byte of input.

If you can use --pipepart GNU Parallel is way faster.

Unfortunately you have to do without -L and instead use --recstart/--recend (possibly with --regexp) to determine where a record starts.

With FASTA this is easy: https://www.gnu.org/software/parallel/man.html#example-call-program-with-fasta-sequence

With FASTQ and especially interleaved FASTQ (i-FASTQ) this is harder. I have not seen enough i-FASTQ files to find a general solution, but the idea is:

  • Make a regexp that always matches the end of an i-FASTQ record. Use that for --recend.
  • Make a regexp that always matches the start of an i-FASTQ record. Use that for --recstart.
  • Make sure recend+recstart only matches at the border of a record.

Then use:

parallel --pipepart -a big.file --regexp --recstart 'starting regexp' --recend 'ending regexp' ...

If someone builds such a regexp couple I think it would make it into the example section of GNU Parallel, because it will be generally useful for i-FASTQ processing.

The regexp couple for normal FASTQ can be found here: https://stackoverflow.com/a/41707920/363028 but this does not guarantee that R1 and R2 will be kept together. We need to find some feature of seqname that distinguishes between R1 and R2 for all i-FASTQ files.

https://en.wikipedia.org/wiki/FASTQ_format mentions these three variants:

@HWUSI-EAS100R:6:73:941:1973#0/1
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1

where '/1' and ' 1:' indicate this is R1.

This should match those:

--regexp --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]'
Ole Tange
  • 31,768
  • 5
  • 86
  • 104