Split file into several files based on condition and also number of lines approximately

Question

I have a large file with a sample as below

A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It's a sample file which has order header(00000) and related order details(00100, 00200 etc.) I want to split file with around 40000 lines each such that each file has order headers and order details together.

I used GNU parallel to achieve the split of 40000 lines, But I am not able to achieve the split to satisfy the condition that makes sure that the Order Header and its related order details are all together in a line making sure that each file has around 40000 lines each

For the above sample file, if I have to split with around 5 lines each, I would use the below

parallel --pipe -N5 'cat > sample_{#}.txt' <sample.txt

But that would give me

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555

sample2.txt
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It would have 2nd Order header in the first file, and its related order details in the second one.

The desired should be

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555

sample2.txt
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

Your text `making sure that each file has around 40000 lines each` doesn't match your desired output ... — tink, Mar 22 '21 at 07:26
Does the original file only have one Order header? Or would you like to split on order header when possible? A requirement could be `combine different order headers in one file when the resulting file is <= 40000 lines` or `Split on order header and when the resulting file > 40000, split that file again.`. — Walter A, Mar 22 '21 at 09:17
@WalterA .. Original file can have any number of Order Header rows, but the file will always have the Order header rows first and then the order details rows until then Next Order Header is encountered and so on. My requirement is that file should be split in such a way that combinations of Order Header and related Order details should be in one file — Arpit Singh, Mar 22 '21 at 23:58

score 4 · Answer 1 · answered Mar 22 '21 at 07:12

4

You may try this code:

( export hdr=$(head -1 sample.txt); parallel  --pipe -N4 '{ echo "$hdr"; cat; } > sample_{#}.txt' < <(tail -n +2 sample.txt) )

We basically keep header row separate and run split on remaining lines while including header in each split file.

answered Mar 22 '21 at 07:12

anubhava

761,203
64
569
643

The OPs sample output doesn't quite match his description ... each output should have around 40000 lines? – tink Mar 22 '21 at 07:25
1

That can be done by changing `-N4` to `-N40000` – anubhava Mar 22 '21 at 07:26
@anubhava I think I was not able to explain the problem. This is a sample sales order files, that have order header(00000) in one row, and the subsequent rows that follows are order details(00100,00200,00500,00900.. and so on), before the next Order Header is encountered, and then again the order details for the Header follows. So, I would like to have ~40K lines each keeping in mind that the combinations of Order Header and details are together in one file – Arpit Singh Mar 22 '21 at 23:55
ok then please clarify how `sample2.txt` expected output shows 2 records of `00000` ? – anubhava Mar 23 '21 at 13:42

Ole Tange · Answer 2 · 2021-03-23T08:07:57.070

0

Single record:

cat file | parallel --pipe --recstart 'A222, 00000, 555' -n1 'echo Single record;cat'

Multiple records (up to --block-size)

cat file | parallel --pipe --recstart 'A222, 00000, 555' --block-size 100 'echo Multiple records;cat'

If 'A222' does not stay the same:

cat file | parallel -k --pipe --regexp --recstart '[A-Z]\d+, 00000' -N1 'echo Single record;cat'

edited Mar 23 '21 at 08:07

answered Mar 23 '21 at 04:27

Ole Tange

31,768
5
86
104

score 0 · Accepted Answer · answered Mar 23 '21 at 09:52

When each Order Header has a lot of records, you might consider the simple

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

When you do want different headers combined in a file, consider

awk -v max=40000 '
   function flush() {
      if (last+nr>max || sample==0) {
         outfile="sample_" sample++ ".txt";
         last=0;
      }
      for (i=0;i<nr;i++) print a[i] >> outfile;
      last+=nr;
      nr=0;
   }
   BEGIN { sample=0 }
   /00000,/ { flush(); }
   {a[nr++]=$0}
   END { flush() }
   ' sample.txt

Thanks for the solution. Is it possible to have all the files start with a Order Header line i.e. line having 00000? Keeping ~40K condition — Arpit Singh, Mar 23 '21 at 11:38
Both the `awk` and the `csplit` will start new files when the line matches `00000,`. You should only make sure that `sample.txt` starts with an Order Header (or change the solution). — Walter A, Mar 23 '21 at 12:44

Split file into several files based on condition and also number of lines approximately

3 Answers3