How to optimally move lines with a specific pattern at the top in a huge file using Perl?

Question

I have a huge csv file of nearly 20k rows with below format:

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

I need to put 2 lines of pattern with same syntax (i.e 4th column) at the top. And then the rest of the lines will be there as is. That means first two lines with syntax as 'perl', then followed by 'java' , 'python' etc that way.

I have so far written this below code using seek and tell to make it optimized. However, it is not working as expected.

use strict;
use warnings;

open(FP, "+<mycsv.csv");

my %hash = ();
my $cur_pos;    


while(<FP>) {

    my $line = $_;
    chomp $line;
    #print "$line aaa\n";
    if($line =~ /^file\,tools,/) {next;}

    if($line =~ /^\w+\,\w+\,\w+,(\w+)\,.*$/) {
        my $type = $1;
        #print "type $type\n";

    if($hash{$type}->{count} < 2 ) {
        #print "--- here type = $type | lastpos = ", $hash{$type}->{lastpos} , "\n";
        $cur_pos = tell(FP);
        my $pos = tell(FP) - length($line); 
        if($hash{$type}->{lastpos} ) {

            my $lastpos = $hash{$type}->{lastpos};
            seek(FP, $lastpos, 1);
            print FP $line;
            seek(FP, $cur_pos, 1);
        } 

        $hash{$type}->{lastpos} = $pos;


    }
        if(exists $hash{$type} ) {
            $hash{$type}->{count} += 1;
        } else {
            $hash{$type}->{count} = 1;
        }


    }
}


close(FP);

The expected output should look like below:

 file,tools,edit,syntax,buffers
    a,b,c,perl,d
    a,e,c,perl,d
    a,w,c33,java,d
    wa,b,c33,java,d
    a,s,c,python,d1
    a,f,c,python,dd
    a,n,c,php,d3
    wa,b,c33,php,d
    d,r,hhh,cpp,d0
    d,buuu,hhh,cpp,d0
    d,m,hhh,c#,d0
    wa,b,c33,c#,d
    a,o,c,pdf,d3 
    a,yb,c,c,ddf 
    d44,b,hhh,nlp,d0
    a,be,c,js,d4  
    a,h,c,perl,dg   
    a,b,c,perl,dt   
    wa,b,c33,java,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,php,d
    wa,b,c33,python,d
    wa,b,c33,perl,d
    wa,b,c33,php,d
    wa,b,c33,java,d
    wa,b,c33,python,d

Any help to make it work would be much appreciated.

Thanks.

What is the output supposed to look like from the input you've provided? — i alarmed alien, Apr 16 '18 at 09:02
@ialarmedalien Hi, please see that I have added the output. First two lines of each syntax type are clubbed together and put at the top. Followed by the rest of the lines. In this example I have tried to do it for two line for each type, it could be any fixed number (3,4,5). I believe if the moving of lines can be done for 2 lines, doing it for 3,4 or 5 lines should be same. Please let me know if the requirement is not clear. Please ignore the space at the beginning which got added while copy/pasting. The indentation should remain same as the input. — A.G.Progm.Enthusiast, Apr 16 '18 at 09:13

score 2 · Answer 1 · answered Apr 16 '18 at 10:15

2

I'm getting little different output than yours for the same logic. Can you please go through this output and let me know if any change is required? Approach is mentioned inline with the comments.

use strict;
use warnings;
use feature 'say';
my $syntax = [];
my $NUM = 2;   # change number if needed
my $filename = 'file.txt';
my $data = {};  # make a hash of data

open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
    chomp $row;
    next if $. == 1; # skip header row
    my @columns = split (',', $row);

    push @$syntax, $columns[3];   # make a list of all syntaxes available
    push @{$data->{$columns[3]}}, $row;
}
close $fh;

my $processed = {};
# loop throught the syntax array and print data from hash
# also, make a counter of the number of times that syntax is used.
# it will help us to skip next (n-1) occurence of that syntax
for my $syntax (@$syntax) {
    if (!$processed->{$syntax}){
        for my $s (splice @{$data->{$syntax}}, 0, $NUM) {
            $processed->{$syntax} += 1;
            say $s;
        }
    } else {
        $processed->{$syntax} -= 1;
    }
}
# print out the remaining values
for my $rem (values %$data){
    say for @$rem;    
}

Output:

a,b,c,perl,d
a,e,c,perl,d
a,w,c33,java,d
wa,b,c33,java,d
a,s,c,python,d1
a,f,c,python,dd
a,n,c,php,d3
wa,b,c33,php,d
d,r,hhh,cpp,d0
d,buuu,hhh,cpp,d0
d,m,hhh,c#,d0
wa,b,c33,c#,d
a,o,c,pdf,d3
a,yb,c,c,ddf
a,h,c,perl,dg
a,b,c,perl,dt
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,perl,d

answered Apr 16 '18 at 10:15

Kamal Nayan

1,890
21
34

I think your output is slightly different. As 'nlp' and 'js' are also a different 'syntax' type (although they are occurring only once, still) hence they should have been moved up after the line 'a,yb,c,c,ddf'. Rest of the line should be written as per original sequence, they need not to be written as per the sequence of same syntax one after another. Lastly, the lines need to be moved up in the same file instead of printing them on the terminal and the first line with column names should also be there in the csv file as is. – A.G.Progm.Enthusiast Apr 16 '18 at 10:57
I have made it on the basis of sequece in the input, say `perl, java, perl, python, perl, python, perl, nlp, python, python` then as per sequence it would be `perl, perl, java, perl (skip), python, python, perl, perl, python (skip), perl (skip), nlp, python, python` i.e. `perl, perl, java, python, python, perl, perl, nlp, python, python`. Please correct if I have misunderstood something. – Kamal Nayan Apr 16 '18 at 11:23
And for writing in same file, read whole file in a array `@data = <$fh>`, loop over it and then open same file in write mode. I'ld suggest to write in a seperate file, rather than reading whole file in an array, unless its a requirement. I'll come up with any more efficient way, if anything strikes my mind. – Kamal Nayan Apr 16 '18 at 11:29
well in this comment the sequence example you have given is correct, but I am not sure whether your code is generating as per that sequence. Never mind. I will see if I can do that. Yes, efficiency I need because it a huge text file so parsing the whole in array or using another file is not what I want. Thanks. – A.G.Progm.Enthusiast Apr 16 '18 at 11:37
My code is not functioning correctly ..so is the problem. I have included the expected output which my code is not able to produce. – A.G.Progm.Enthusiast Apr 16 '18 at 11:42

score 2 · Accepted Answer · answered Apr 16 '18 at 12:10

I'd approach this by parsing the file to collect those first pairs of lines in a data structure and sending the other lines to a temp file. Once you've finished parsing the file, print out the pairs of lines from the data structure into your output file, and then add the temp file on to the end of the output file.

sample code:

use strict;
use warnings;
use feature ':5.16';

my $infile = 'infile';
my $outfile = 'outfile';
my $tempfile = 'temp';
my $quantity = 2;  # or whatever

open my $in, '<', $infile or die 'Could not open infile: ' . $!;
open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;

my $hash = {};
my @order;
my $hdr;

while ( <$in> ) {
  if ( $hdr ) {
    my @cols = split ",", $_;
    my $key = $cols[3];

    # have we seen this key before?
    if ( ! $hash->{$key} ) {
      push @order, $key;
      $hash->{$key} = [ $_ ];
    }
    elsif ( scalar @{$hash->{$key}} < $quantity ) {
      push @{$hash->{$key}}, $_;
    }
    else {
      print { $temp } $_;
    }
  }
  else {
    # the header line
    print { $out } $_;
    $hdr = $_;
  }
}

# print the collected twofers out into the tempfile
for my $key ( @order ) {
  print { $out } @{$hash->{$key}};
}
close $out;
close $temp;

# concatenate the files
system join ' ', ( 'cat', $tempfile, '>>', $outfile );

If the paired lines don't have to be in the order that they appear in the source file, you can skip the @order stuff.

@i alarmed alien can't we do it without opening multiple files. In the same file , is it possible to move lines up and down ? If yes, please let me know or point me to something which shows how to do it. Thanks for the post. — A.G.Progm.Enthusiast, Apr 17 '18 at 08:39
It is possible, yes, but IMO much easier to create a temp file and then delete it. You can also do the whole thing in memory—instead of printing to a temp file, push lines on to an array, which you print out once you've gone through the input file and found the first occurrences of each type. There are lots of different ways to handle this task; I'm not a fan of rewriting files in place as if something goes wrong, you lose your original file. Creating a new output file leaves your input file untouched if it needs to be used for something else (or if you find a bug in your code!). — i alarmed alien, Apr 17 '18 at 12:06

Sinan Ünür · Answer 3 · 2018-04-16T12:55:05.160

I have a huge CSV file of nearly 20k rows with below format:

That is not huge by any stretch. The file size is probably about a megabyte.

While I normally recommend line-by-line processing to ensure robustness with respect to file size, in this case, you know that the files you are dealing with are small. The question is whether the time you spend optimizing this thing is worth it.

If I understand you correctly, your problem can be solved quickly (in programmer time) by wasting some memory:

#!/usr/bin/env perl

use strict;
use warnings;
use List::Util qw( uniqstr );

my $TOP = 2;

(my $header = <DATA>) =~ s/\s+\z//;
my @header = split /,|\s+/, $header;
my %idx = map +($header[$_] => $_), 0 .. $#header;

my @lines = grep /\S/, <DATA>;
my %syntax_of = map +($_ => (split /,/, $_)[$idx{syntax}]), @lines;

my @syntaxes = uniqstr map $syntax_of{$_}, @lines;

my %lines_of;
for my $n (0 .. $#lines) {
    push @{$lines_of{$syntax_of{$lines[$n]}}}, $n;
}

print "$header\n";

for my $syntax (@syntaxes) {
    my @top = grep defined, map $lines_of{$syntax}->[$_ - 1], 1 .. $TOP;
    print @lines[@top];
    # normally, invoking delete on an array slice is not
    # but it is just what we need here.
    delete @lines[@top];
}

print grep defined, @lines;

__DATA__
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

PS: See also Tie::File

PPS: At first sight, there are at least six things one might be inclined to tweak here if one wanted to spend time on it.

%idx = map +($header[$_] => $_), 0 .. $#hdr , I understood what it is doing but i am more familiar with map {$hash{$_}=> 'something'} @arr kind of syntax. If you explain the map+() thing would be great. uniqstr is not available to me in List::Util, so had to do it by using a hash. Maybe 20k is not huge, but it may go up to 2000k , so I need an optimized way. You did not show how to write to the file, should I do it with tie::file? or I can simply print those lines in another file. But not sure about the optimization part. Lastly, thanks for the code. — A.G.Progm.Enthusiast, Apr 17 '18 at 07:43
"*Maybe 20k is not huge, but it may go up to 2000k*" ... Now, you are changing problem parameters. Still 2 million rows is only about 160 MB, still no where near "huge". If you had 160 GB, then you are getting into largish file sizes. The trade-off between multiple passes over a file versus an array is not obvious, and you'd have to measure to see which one performs better. I used `uniquestr` because it preserves the order of appearance in the data file (which seemed to be one of your criteria), one could have also utilized `Tie::IxHash`. `map expr, list` is faster than `map block list`. — Sinan Ünür, Apr 17 '18 at 10:35
If I use Tie::File and read the entire file in `@arr` at first , then I use `@arr` instead of `@lines` in the line `my %syntax_of = map +($_ => (split /,/, $_)[$idx{syntax}]), @lines;`. In that case it splits the header and any blank line again. Since you are reading the header at first and the rest of the lines as `my @lines = grep /\S/, ;` so it is working for you. How do I take care of this using Tie::File. Or instead of using map , I do them with loop and keep doing next when I get headers or blank lines? — A.G.Progm.Enthusiast, Apr 18 '18 at 15:49

How to optimally move lines with a specific pattern at the top in a huge file using Perl?

3 Answers3