1

I want to pipe a stream to split. I know how big will be the stream in bytes (very big, comes from network), I want split to create N files of relatively equal size, without splitting lines in half. Is it possible to achieve that. Something like:

cat STREAM | split $SIZE_OF_STREAM $NUMBER_OF_FILES_TO_PRODUCE

I could not find a way to achieve that through docs, I'm sorry if it was obvious but I couldn't find it.

user9517
  • 115,471
  • 20
  • 215
  • 297
ddinchev
  • 433
  • 1
  • 7
  • 18

3 Answers3

1

Oh well, it seems that the split utility on Mac (and maybe BSD) is one option short :(

On Linux, there is -C option, which enables you to say each chunk of lines to be of how many bytes. Or said in simpler way - if you pass cat file | split -C 1000, it will create chunks of UP TO 1000 bytes of whole lines, which with elementary math gives me an easy way to achieve what I wanted.

ddinchev
  • 433
  • 1
  • 7
  • 18
  • 1
    Just comparing the Linux and Mac versions that I have on my desk, Linux has `-C`, `-d`, `-e`, `--filter`, `-n`, `-u` and `--verbose` that the Mac version does not. The Mac version has `-p` which splits files based on a regular expression that Linux does not. – Ladadadada Dec 05 '12 at 08:49
0

create file which will be out STREAM:

echo "1234\n5678" > xfile

now will split it

for i in $(seq 0..`wc -c xfile|awk '{print $1}'`); do let a=`expr $i \* 2`; dd if=xfile of=file$i bs=1 count=2 skip=$a; done

it will give you a log of files with fixed size 2 bytes and names file1, file2, file3....

alterpub
  • 252
  • 3
  • 10
  • you probably meant `echo -e ...` and I get `dd: invalid number \`{0..11}'`. – user9517 Dec 05 '12 at 08:41
  • Oh and the OP doesn't want a fixed size of bytes - that's easy, `split -b` can to that if you know the size of the file beforehand. Similarly if you know the number of lines you can use `split -l` ... – user9517 Dec 05 '12 at 08:44
  • I have fixed example, now it works fine ! – alterpub Dec 05 '12 at 08:47
0

I would simply split on line count as that will make all files except for the last one nearly equal.

export LINE_COUNT=100,000
cat $STREAM | split -l $LINE_COUNT

You could do the math with $SIZE_OF_STREAM divided by $NUMBER_OF_FILES_TO_PRODUCE but just setting a line count gets you 90% of the way there for having all files basically equal unless your line length is distributed in a very non-normal manner.

I have linked to the online documentation, but man pages are shipped with OS X so you can see that split there has a byte cutoff as well as a line cutoff.

bmike
  • 283
  • 1
  • 19