4

I know how to do this:

commandGeneratingLotsOfSTDOUT | bzip2 -z -c > compressed.bz2

I also know how to do this:

commandGeneratingLotsOfSTDOUT | split -l 1000000

But I don't know how to do this:

commandGeneratingLotsOfSTDOUT | split -l 1000000 -compressCommand "bzip2 -z -c"

In case the above isn't already 100% clear, I am running a command that generates a terabyte or two of output. I want the output to be split into chunks of N lines (1 million in this case), and each chunk to be bzip2 compressed and stored in a file.

Right now what I do is this:

commandGeneratingLotsOfSTDOUT | split -l 1000000
foreach fileGenerated { bzip2 -z thatFile }

This adds an extra write to disk and read from disk (and write to disk again, albeit compressed) for every single file! Since the files are all bigger than RAM, this translates to actual disk usage.

philo vivero
  • 427
  • 5
  • 16

2 Answers2

6

How about:

cmdWithLotsOfSTDOUT | split -l 1000000 --filter 'bzip2 > "$FILE.bz2"'

An example:

$ ls
afile

$ cat afile
one
two
three
four
five
six
seven
eight
nine
ten

$ cat afile | split -l 2 --filter='bzip2 > "$FILE.bz2"'

$ ls
afile  xaa.bz2  xab.bz2  xac.bz2  xad.bz2  xae.bz2

$ bzip2 -dc xac.bz2
five
six

$
ooga
  • 15,423
  • 2
  • 20
  • 21
  • You have the right idea. The syntax is wrong, though. It's --filter='bzip2 -c > $FILE.bz2' - only the = is missing, what you're doing with the double-quotes is useful if $FILE has spaces in it, but in my case isn't precisely necessary. – philo vivero May 16 '14 at 17:38
  • It works with or without the `=`, and for compression the `-c` switch is unnecessary. It's really only necessary for decompression, which otherwise creates a file. – ooga May 16 '14 at 17:47
0

I'm going to answer this question, but hopefully I will not have to mark it as the correct answer.

The GNU coreutils are open source. There's a repo here, for example: https://github.com/goj/coreutils. It contains the source code for the split command as split.c: https://github.com/goj/coreutils/blob/rm-d/src/split.c. One could modify it to:

  1. Add capability to take as an input argument a program through which the split chunks will be piped,
  2. Have split pass to the program what file it should write to.

This is suboptimal as one would have to be proficient in C and GNU programming practices, etc. I have the technical know-how to do this, but I would be hesitant to do the work unless I know the patch would be accepted back into main-line. Coordinating with the fine folks in #gnu might be required.

Another way to do this is write your own splitCompress program/script. I actually did so, in Perl, and it was about 10x less performant than my method laid out in the question. There might be ways of optimizing Perl for streaming large amounts of data. I put a copy of the Perl program here: http://faemalia.com/Technology/splitCompress.pl.html. It's possible with some tweaking, this program could become the basis for a great Right Answer.

EDIT: I just looked at the logs, and actually the Perl "splitCompress.pl" program is roughly equivalent in speed to the method outlined in the question. It is not 10x less performant.

philo vivero
  • 427
  • 5
  • 16
  • Have you looked into the `--filter` option to `split`? I was trying to get it to work but can't quite do it. – ooga May 16 '14 at 17:11
  • Ooga: can you offer this as an answer? I'm feeling pretty retarded. The --filter option to split looks like the right thing. I'd like to mark your answer correct if we can get it to work. – philo vivero May 16 '14 at 17:22