1

I have a large .csv file (2 - 3 million records). I need to concatenate the first three fields (with underscores) and append it to each record, then I need to sort file based on that new field and three other fields. I am able to do that (am testing it with a 4 record file for now) - but I'm not sure how to write it back to the file in the same .csv form - instead of the way Data::Dumper formats each line as a separate variable. Here is the code I have so far - I have a couple of Print (to screen) lines to see what it's doing -

#!/usr/bin/perl/

use strict;
use warnings;
use Data::Dumper;

my $filename = '/testpath/test.csv';
#$filename = 'test.csv';

open my $FH, $filename
  or die "Could not read from $filename <$!>, program halting.";

# Read the header line.
chomp(my $line = <$FH>);
my @fields = split(/,/, $line);
#print "Field Names:\n", Dumper(@fields), $/;
print Dumper(@fields), $/;

my @data;
# Read the lines one by one.
while($line = <$FH>) {

# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
    chomp($line);
    my @fields = split(/,/, $line);
    unshift @fields, join '_', @fields[0..2];
    push @data, \@fields;
}
close $FH;
print "Unsorted:\n", Dumper(@data); #, $/;

@data = sort {
    $a->[0] cmp $b->[0] ||
    $a->[20] cmp $b->[20] ||
    $a->[23] cmp $b->[23] ||
    $a->[26] cmp $b-> [26]
} @data;


open my $OFH, '>', '/testpath/parsedTest.csv';
print $OFH Dumper(@data);
close $OFH;

exit;

I'm assuming it's in the "print $OFH Dumper(@data);" line that I need to re-format it back to its original form.

And please be kind as I am a novice.

__________EDIT__________________________________

Here are the four lines from the test .csv file - first line is the header record:

STORE_NBR,CONTROL_NBR,LINE_NBR,SALES_NBR,QTY_MISTINT,REASON_CODE,MISTINT_COMM,SZ_CDE,TINTER_MODEL,TINTER_SERL_NBR,SPECTRO_MODEL,SPECTRO_SERL_NBR,EMP_NBR,TRAN_DATE,TRAN_TIME,CDS_ADL_FLD,PROD_NBR,PALETTE,COLOR_ID,INIT_TRAN_DATE,GALLONS_MISTINTED,UPDATE_EMP_NBR,UPDATE_TRAN_DATE,GALLONS,FORM_SOURCE,UPDATE_TRAN_TIME,SOURCE_IND,CANCEL_DATE,COLOR_TYPE,CANCEL_EMP_NBR,NEED_EXTRACTED,MISTINT_MQ_XTR,DATA_SOURCE,GUID,QUEUE_NAME,BROKER_NAME,MESSAGE_ID,PUT_TIME,CREATED_TS
1334,53927,1,100551589,1,6,Bad Shercolor Match,16,IFC 8112NP,01DX8005513,,,77,10/23/2015,95816,,OV0020001,,MANUAL,10/21/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,5394A0E67FFF4D01A0D9AD16FA29ABB1,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533133333464EB2956052C0020,10/23/2015 10:45,10/23/2015 10:45
2525,67087,1,650462328,1,4,Tinted Wrong Product,14,IFC 8012NP,Standalone-5,,,11,10/23/2015,104314,,A91W00353,,,10/20/2015,0.25,0,,0.25,,,COMP,,CUSTOM MATCH,0,TRUE,TRUE,O,1AC5D8742D47435EA05343D57372AD32,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533235323531C2295605350020,10/23/2015 10:46,10/23/2015 10:47
1350,163689,1,650462302,1,3,Tinted Wrong Color,14,IFC 8012NP,06DX8006805,,,1,10/23/2015,104907,,A91W00351,COLOR,6233,10/23/2015,0.25,0,,0.5,ENG,,SW,,PALETTE,0,TRUE,TRUE,O,F1A072BCC548412FA22052698B5B0C28,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F53313335307BC12956053C0020,10/23/2015 10:52,10/23/2015 10:52

Hope that's not too convoluted to read.

BigRedEO
  • 807
  • 4
  • 13
  • 33
  • Could you show the 4-record input sample and how you'd like to have it formatted on output? Perhaps we have other ideas on how to achieve that... – PerlDuck Mar 28 '16 at 17:22
  • 4
    You should use a proper CSV module (i.e., [Text::CSV](https://metacpan.org/pod/Text::CSV)) to handle composition and decomposition of CSV data. There are plenty of examples on the page I linked that will show you exactly what you need to do. – Matt Jacob Mar 28 '16 at 17:23
  • @PerlDog - added the four lines above. – BigRedEO Mar 28 '16 at 17:37
  • @MattJacob: There is nothing wrong with `split /,/` if there are no quoted fields – Borodin Mar 28 '16 at 20:16
  • @BigRedEO: and how you'd like to have it formatted on output? – Borodin Mar 28 '16 at 20:17

3 Answers3

3

Data::Dumper outputs a format that is valid perl, and is good for debugging, but not for writing a CSV file. You could write the CSV by hand:

foreach my $row (@data) {
    print $OFG join(',', @$row), "\n";
}

but you really should use a specialized module, in this case Text::CSV, both for reading and writing the CSV – it will handle all the border cases (such as fields with embedded commas).

The synopsis contains a good example of both reading and writing; I won't repeat that here.

mscha
  • 6,509
  • 3
  • 24
  • 40
2

You don't have to rebuild the line if you just store it in @data too!

my @data;
while(my $line = <$FH>) {
    chomp($line);
    my @fields = split(/,/, $line);
    push @data, [ "$line\n", join('_', @fields[0..2]), @fields[19, 22, 25] ];
}

@data = sort {
    $a->[1] cmp $b->[1] ||
    $a->[2] cmp $b->[2] ||
    $a->[3] cmp $b->[3] ||
    $a->[4] cmp $b->[4]
} @data;

print($OFH $_->[0]) for @data;

If your input didn't contain NULs, you could even use the following faster approach:

print $OFH
   map { /[^\0]*\z/g }
   sort
   map {
      chomp;
      my @fields = split /,/;
      join("\0", join('_', @fields[0..2]), @fields[19, 22, 25], "$_\n")
   }
   <$FH>;

But yeah, you should probably use a legit CSV parser.

use Text::CSV_XS qw( );

my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });

my @data;
while (my $row = $csv->getline($FH)) {
    push @data, [ join('_', @$row[0..2]), $row ];
}

@data = sort {
    $a->[0]     cmp $b->[0]     ||
    $a->[1][19] cmp $b->[1][19] ||
    $a->[1][22] cmp $b->[1][22] ||
    $a->[1][25] cmp $b->[1][25]
} @data;

$csv->say($OFH, $_->[1]) for @data;

The following is the fast approach using a CSV parser:

use Text::CSV_XS qw( );

my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });

print $OFH
   map { /[^\0]*\z/g }
   sort
   map {
      $csv->parse($_);
      my @fields = $csv->fields();
      join("\0", join('_', @fields[0..2]), @fields[19, 22, 25], $_)
   }
   <$FH>;
ikegami
  • 367,544
  • 15
  • 269
  • 518
0

Was unable to use the Text::CVS_XS because it was not available on our server, unfortunately - but did find adding this single "print" line worked -

open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedTest.csv'; 
print $OFH join(',', @$_), $/ for @data; 
close $OFH; 

Tested out fine with the small file, now to test on the actual file!

BigRedEO
  • 807
  • 4
  • 13
  • 33
  • Not using a (proven) module and writing your own (potentially buggy) code just because it isn't installed on your server, is not the way to do perl development. Find some way to get it installed on your server. – mscha Mar 28 '16 at 20:58