2

I have two files both of them are delimited by pipe.

First file: has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.

first file detail:

1|alpha|s3.3|4|6|7|8|9

2|beta|s3.3|4|6|7|8|9

20|charlie|s3.3|4|6|7|8|9

6|romeo|s3.3|4|6|7|8|9

Second file detail:

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12

a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12

a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12

a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12

My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.

expected output:

unique string : ray kate bob

a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12

a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12

a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12

a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12

a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

I am able to pick the unique string using below command

awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename

I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.

Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.

Any awk/perl solution keeping speed in mind would be really helpful :)

Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.

UPDATE : Code

#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });

my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";

open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
    chomp $line;

    if ($csv->parse($line)) {

        my @fields = $csv->fields();
        $hash{$field[0]}=$field[1];

    } else {
        warn "Line could not be parsed: $line\n";
    }
}
close($data);

my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";

open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
    chomp $line;

    if ($csv->parse($line)) {

        my @fields = $csv->fields();
        if (defined ($field[-4]) && looks_like_number($field[-4]))
        {
            $field[-4]=$hash{$field[-4]};
        }

        $csv->print($fh,\@fields); 
    } else {
        warn "Line could not be parsed: $line\n";
    }
}
close($data2);
close($fh);
Axeman
  • 29,660
  • 2
  • 47
  • 102
chidori
  • 1,052
  • 3
  • 12
  • 25
  • Re. "Text::CSV parsing seems to be slow." Are you sure the issue is with Text::CSV and not with your code? (you haven't shown any so I can't tell) – ThisSuitIsBlackNot Mar 31 '14 at 18:04
  • Also, if Text::CSV really *is* the bottleneck, installing [Text::CSV_XS](https://metacpan.org/pod/Text::CSV_XS) should give you a performance boost. If Text::CSV_XS is installed, `use Text::CSV;` will use the XS version automatically. – ThisSuitIsBlackNot Mar 31 '14 at 18:07
  • You need to print out any unique values from the third column but update the fourth-to-last column in any row where it matches a value from the first column in the first file? Or only update rows which give you a unique value for the third column? – Etan Reisner Mar 31 '14 at 18:15
  • 3rd column and 4th from the last column are not dependent on each other. I need to read the file but i have to perform two operations at the same time. One pick the unique strings from the 3rd column and replace 4th from the last column – chidori Mar 31 '14 at 18:26
  • 1
    @ThisSuitIsBlackNot have updated my perl code using `Text::CSV` – chidori Mar 31 '14 at 18:27
  • The typical usage for reading from a file with `Text::CSV` is to use `getline`, not `parse`: `while (my $row = $csv->getline($fh)) { ... }`. In this case, `$row` is an array reference, so you can access the first field with `$row->[0]` and get all fields with `@$row`. I haven't benchmarked it, so not sure if that would make a significant speed difference or not. – ThisSuitIsBlackNot Mar 31 '14 at 18:52
  • 1
    If you set the `auto_diag` option to the `new` method, an error will be displayed automatically if a line fails to parse with `getline`, so you don't have to add a bunch of `or die ...` statements: `my $csv = Text::CSV->new({ sep_char => '|', auto_diag => 1 });` – ThisSuitIsBlackNot Mar 31 '14 at 18:56
  • You know you have defined two `$csv` objects? With different options. – TLP Mar 31 '14 at 19:14
  • 1
    I threw together a rough benchmark and using `getline` is significantly faster than `parse`. – ThisSuitIsBlackNot Mar 31 '14 at 19:30

3 Answers3

3

Here's an option that doesn't use Text::CSV:

use strict;
use warnings;

@ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';

my ( %hash, %seen );
local $" = '|';

while (<>) {
    my ( $key, $val ) = split /\|/, $_, 3;
    $hash{$key} = $val;
    last if eof;
}

open my $outFH, '>', pop or die $!;

while (<>) {
    my @F = split /\|/;
    $seen{ $F[2] } = undef;
    $F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
    print $outFH "@F";
}

close $outFH;

print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";

Command-line usage: perl firstFile secondFile outFile

Contents of outFile from your datasets (asterisks removed):

a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

STDOUT:

unique string : ray kate bob

Hope this helps!

Kenosis
  • 6,196
  • 1
  • 16
  • 16
1

Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Text::CSV;

my $csv = Text::CSV->new({
    auto_diag      => 1,
    binary         => 1,
    blank_is_undef => 1,
    eol            => $/,
    sep_char       => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;

open my $map_fh, '<', 'map.csv' or die "map.csv: $!";

my %mapping;
while (my $row = $csv->getline($map_fh)) {
    $mapping{ $row->[0] } = $row->[1];
}

close $map_fh;

open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";

my %seen;
while (my $row = $csv->getline($in_fh)) {
    $seen{ $row->[2] } = 1;

    my $key = $row->[-4];
    $row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
    $csv->print($out_fh, $row);
}

close $in_fh;
close $out_fh;

say join ',', keys %seen;

map.csv

1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9

input.csv

a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

output.csv

a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

STDOUT

kate,bob,ray
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
  • Thanks , i am able to get the desired result , but when it comes to speed , using the traditional split approach seems to be faster. Using Text::CSV module is slow though it has greater control over csv file. – chidori Apr 01 '14 at 17:32
  • @chidori `split` will be faster, but if there's any possibility that a field could contain the delimiter (now or sometime in the future), `Text::CSV` is the way to go. – ThisSuitIsBlackNot Apr 01 '14 at 17:37
1

This awk should work.

$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
    for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt

Explanation:

  • We set the input and output field separators to |.
  • We iterate through first file creating an array storing column one as key and assigning column two as the value
  • Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
  • While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
  • In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
  • The entire output of the second file is redirected to output.txt which now has the modified forth column from last.

$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12

$ cat unique.txt
kate
bob
ray
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • Thanks , its good but i need to put within a perl code , if i backslash all $ symbols still perl complains.. can you help me put this inside a perl code – chidori Apr 01 '14 at 17:33
  • @chidori I wouldn't use `awk` inside `perl`. `perl` is powerful enough to do all of this quite efficiently. This was just an alternate solution. – jaypal singh Apr 01 '14 at 18:53