14

Is there a one-liner to split a text file into pieces / chunks after every Nth occurrence of a delimiter?

example: the delimiter below is "+"

entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
...

There are several million entries, so splitting on every occurrence of delimiter "+" is a bad idea. I want to split on, say, every 50,000th instance of delimiter "+".

Unix commands "split" and "csplit" just don't seem to do this...

cmo
  • 3,762
  • 4
  • 36
  • 64

3 Answers3

14

Using awk you could:

awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt 

Update:

To not include the delimiter, try this:

awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt 

The next keyword causes awk to halt processing rules for this record and and advance to the next (line). I also changed the >> to > since if you run it more than once you probably don't want to append the old chunk files.

FatalError
  • 52,695
  • 14
  • 99
  • 116
  • But this would append each line individually... . won't that be incredibly slow because of so much i/o ? – cmo Mar 21 '13 at 23:47
  • 2
    From the gawk manual "Redirecting output using `>', `>>', or `|' asks the system to open a file or pipe only if the particular file or command you've specified has not already been written to by your program, or if it has been closed since it was last written to." So it's a bit different than doing it in a shell. – FatalError Mar 21 '13 at 23:51
  • Wow, that is extremely technical catch. But useful! – cmo Mar 22 '13 at 00:31
  • One final question for bonus points - with this method, the first line in each "chunks" file that is created is the delimiter `+` above). What if I want NEITHER the first NOR last line of each file to be a delimiter? (i.e., begin and end "cleanly"). – cmo Mar 22 '13 at 01:41
1

It isn't very hard to do in Perl if you can't find a suitable alternative (and it will perform pretty well):

#!/usr/bin/env perl
use strict;
use warnings;

# Configuration items - could be set by argument handling
my $prefix = "rs.";     # File prefix
my $number = 1;         # First file number
my $width  = 4;         # Number of digits to use in file name
my $rx     = qr/^\+$/;  # Match regex
my $limit  = 3;         # 50,000 in real case
my $quiet  = 0;         # Set to 1 to suppress file names

sub next_file
{
    my $name = sprintf("%s%.*d", $prefix, $width, $number++);
    open my $fh, '>', $name or die "Failed to open $name for writing";
    print "$name\n" unless $quiet;
    return $fh;
}

my $fh = next_file;  # Output file handle
my $counter = 0;     # Match counter
while (<>)
{
    print $fh $_;
    $counter++ if (m/$rx/);
    if ($counter >= $limit)
    {
        close $fh;
        $fh = next_file;
        $counter = 0;
    }
}
close $fh;

That's far from being a one-liner; I'm not sure whether that's a merit or not. The items that should be configured are grouped together, and could be set via command line options, for example. You could end up with an empty file; you could spot that and remove it if necessary. You'd need a second counter; the existing one is a 'match counter' but you'd also need a line counter, and if the line counter was zero at the you'd remove the last file. You'd also need the name to be able to remove it...fiddly, but not difficult.

Give the input (basically two copies of your sample data), the output from repsplit.pl (repeat split) was as shown:

$ perl repsplit.pl data
rs.0001
rs.0002
rs.0003
$ cat data
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
$ cat rs.0001
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
$ cat rs.0002
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
$ cat rs.0003
entry 3
some more
+
entry 4
some more
+
$
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

Using and + as input separator in a concise "one-liner" :

If you'd like to do $_ > newprefix.part.$c like stated in your comment :

$ limit=50000 perl -053 -Mautodie -lne '
    BEGIN{$\=""}
    $count++;
    if ($count >= $ENV{limit}) {
        open my $fh, ">", "newprefix.part.$c";
        print $fh $_;
        close $fh;
    }
' file.txt

$ ls -l newprefix.part.*

Doc

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223