split 10 billion line file into 5,000 files by column value in Perl or Python

Question

I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python?

This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory.

Here is a small example: `for i in {1..1000}; do echo "$i ROW${i}"; done | sort -t $'\t' -n -k1 -r | split -l 500` you should be able to extrapolate a solution from this. — Hunter McMillen, Apr 14 '16 at 18:12

score 2 · Answer 1 · answered Apr 14 '16 at 18:20

2

awk to the rescue!

awk 'f!=$1{close(f)} {f=$1; print >> f}' file

it will process row by row, will keep one file open at a time.

If you split the original file into chunks this can be done more efficiently in parallel and merge back the generated files (need to mark them if order need to be preserved)

answered Apr 14 '16 at 18:20

karakfa

66,216
7
41
56

thanks my files aren't sorted though, so would be better if more are open than one? – OneSolitaryNoob Apr 14 '16 at 18:23
You can remove the close block altogether if you split your original file into chunks. `awk` can handle thousands of open files. – karakfa Apr 14 '16 at 18:28
3

But you may hit the OS limit on number of open files. – reinierpost Apr 14 '16 at 18:30

score 2 · Answer 2 · answered Apr 14 '16 at 18:27

2

You can keep a hash (an associative array) mapping column values to open output file handles, and open an output file only if none is open for that column value yet.

This will be good enough unless you'll hit your limit on maximum number of open files. (Use ulimit -Hn to see it in bash.) If you do, either you need to close file handles (e.g. a random one, or the one that hasn't been used the longest, which is easy to keep track of in another hash), or you need to do multiple passes across the input, processing only as many column values as you can open output files in one pass and skipping them in future passes.

answered Apr 14 '16 at 18:27

reinierpost

8,425
1
38
70

I think your first suggestion is the good one for the general case (more files than the maximum limit you can open at the same time), the second with multiple passes will probably take more time (I think). About the first approach, I suggest that instead of closing a random one of the one that hasn't been used the longest, to fix a limit of opened files and to store all other records in a buffer with a fixed size. Once this fixed size is reached, you close all files stored in your hash and you open all the files you can from the record buffer. – Casimir et Hippolyte Apr 14 '16 at 21:02
Interesting, but why will that be better? – reinierpost Apr 14 '16 at 22:38
Because you can use this buffer to find the best candidates for the next set of filehandlers. – Casimir et Hippolyte Apr 14 '16 at 22:52
This may still be suboptimal, It really depends on the input data. – reinierpost Apr 20 '16 at 10:18
Thanks Rob. (Doesn't SO have personal messages?) – reinierpost May 18 '16 at 08:49

Borodin · Accepted Answer · 2016-07-03T14:16:50.230

This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records

It keeps a hash %fh of file handles and a parallel hash %opened of flags that indicate whether a given file has ever been opened before. A file is opened for append if it appear in the %opened hash, or for write if it has never been opened before. If the limit on open files is hit then a (random) selection of 1,000 file handles is closed. There is no point in keeping track of when each handle was last used and closing the most out of date handles: if the data in the input file is randomly ordered then every handle in the hash has the same chance of being the next to be used, alternatively if the data is already sorted then none of the file handles will ever be used again

use strict;
use warnings 'all';

my %fh;
my %opened;

while ( <> ) {

    my ($tag) = split;

    if ( not exists $fh{$tag} ) {

        my $mode = $opened{$tag} ? '>>' : '>';

        while () {

            eval {
                open $fh{$tag}, $mode, $tag or die qq{Unable to open "$tag" for output: $!};
            };

            if ( not $@ ) {
                $opened{$tag} = 1;
                last;
            }

            die $@ unless $@ =~ /Too many open files/;

            my $n;
            for my $tag ( keys %fh ) {
                my $fh = delete $fh{$tag};
                close $fh or die $!;
                last if ++$n >= 1_000 or keys %fh == 0;
            }
        }
    }

    print { $fh{$tag} } $_;
}


close $_ or die $! for values %fh;

split 10 billion line file into 5,000 files by column value in Perl or Python

3 Answers3

Linked