GNU Parallel: split file into children

Question

Goal

Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.

Input File

name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)

Hardware

36 GB Memory
16 CPUs
HPCC environment (I'm not admin)

Code

Version 1

zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.

Version 2

# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.

zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:

parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.

Questions

How can my code be improved ?
Is there a faster way to accomplish this goal ?

Not sure I understand your input files. If your files must have 104 million lines each because you have 16 CPUs, I deduce your input files have 1.6 billion lines. Then you say the records are 3GB each, so you have 1.6 billion records of 3GB each compressed into a 39GB file. I am kind of thinking I want that compression algorithm :-) Please advise which bit I have misunderstood! — Mark Setchell, Jan 17 '17 at 18:44
@MarkSetchell : file1.fastq.gz (39 GB) contains 1,667,430,708 lines. Its children should contain, at most, 104,214,420 lines each. Honestly, I don't know the size of the largest line/record. I arrived at --block-size 3000000000 by noticing Parallel had increased the size from 1 MB (default) to over 2 GB. I figured 3 GB would be safe. Please enlighten :) — fire_water, Jan 17 '17 at 19:02
Sorry, I cannot enlighten - I am unenlightened :-( Just trying to understand. I think we may have to wait for Ole to enlighten us all :-) — Mark Setchell, Jan 17 '17 at 19:18
[This answer on Super User](http://superuser.com/a/906756/2085) might be helpful. — Sean Bright, Jan 17 '17 at 20:20
@SeanBright : the file is ~150 GB uncompressed ( gzip -dc file1.fastq.gz | wc -c ). — fire_water, Jan 17 '17 at 20:21

Ole Tange · Answer 1 · 2023-02-06T09:48:11.997

Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.

You tell that to GNU Parallel with -L 4.

In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.

To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.

zcat file1.fastq.gz |
  parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:

zcat file1.fastq.gz |
  parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"

parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.

Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

Here it is @EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with @EAS54_6_R. It just does not happen.

We can use that to our advantage, because now you can use \n followed by @EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:

parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.

parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.

It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:

parallel -a file1.fastq --pipe-part --block -1 -j16 
--regexp --recend '\n' --recstart '@.*\n[A-Za-z\n\.~]'
my_command

Here we assume that the lines will start like this:

@<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>

Even if you have a few quality lines starting with '@', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with @.

You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:

You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).

By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.

You are a scholar and a wise man. Thank you! Question: let's say I am processing the FASTQ files in pairs. For example, in paired-end sequencing (a bioinformatics term), we have two FASTQ files: file1.r1.fastq.gz & file1.r2.fastq.gz. Here, the order matters: record 1 in the r1 file is paired with record 1 in the r2 file, and so on, later in the analysis. What would the Parallel command look like to accommodate for this scenario? — fire_water, Jan 18 '17 at 15:41

score 1 · Answer 2 · answered Jan 18 '17 at 23:01

Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.

split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:

doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.

Filenames will be named big.aa.gz .. big.ap.gz.

Thanks, again! I'm just now reading this. Yesterday I was playing around and came up with some code that I wanted to show you: parallel zcat {} '|' split -l ${child_num_lines} --filter=\''gzip > $FILE.gz'\' - ${temp_dir}/{}_ ::: "${r1_fastq_gz}" "${r2_fastq_gz}" Does that accomplish the same thing as your code but in a different, and likely slower, way while still maintaining predictable order ? — fire_water, Jan 20 '17 at 17:22

GNU Parallel: split file into children

2 Answers2

Linked