6

Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.


I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.

First, I glob a directory for .txt files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.

my $txt_file = glob "/some/cheese/dir/*.txt";

Then I open the file with this line:

open (F, $txt_file) || die ("Could not open $txt_file");

As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr() function within a while loop.

while ($line = <F>)
{
$nom_stat   = substr($line,0,1);
$lname      = substr($line,1,15);
$fname      = substr($line,16,15);
$mname      = substr($line,31,1);
$address    = substr($line,32,30);
$city       = substr($line,62,20);
$st         = substr($line,82,2);
$zip        = substr($line,84,5);
$lnum       = substr($line,93,9);
$cl_rank    = substr($line,108,4);
$ceeb       = substr($line,112,6);
$county     = substr($line,118,2);
$sex        = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major  = substr($line,122,3);
$acad_idx   = substr($line,125,3);
$gpa        = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}


This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.

Can anyone suggest a more efficient/preferred method?

CheeseConQueso
  • 5,831
  • 29
  • 93
  • 126
  • 1
    See http://stackoverflow.com/questions/1083269/is-perls-unpack-ever-faster-than-substr for some relevant benchmarks. – mob Mar 02 '11 at 22:07
  • 1
    See http://stackoverflow.com/q/5083436#comment-5695536 for mob's list of dupes. – daxim Mar 02 '11 at 22:11

4 Answers4

8

It looks to me that you are working with fixed width fields here. Is that true? If it is, the unpack function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation for pack which is unpack's logical inverse. As a basic example simply:

my @values = unpack("A1 A15 A15 ...", $line);

where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to unpack as some people use it, but I believe this will suffice for basic use.

Joel Berger
  • 20,180
  • 5
  • 49
  • 104
  • @daxim, thanks, I hope I used it correctly, I don't have much experience in writing templates for it. – Joel Berger Mar 02 '11 at 22:08
  • thanks joel... even though the thread that mob suggested shows that substr is better, it might be in only certain contexts. this unpack is new to me and seems logical that it would be preferred because they are indeed fixed length fields. I'll try it out... thanks – CheeseConQueso Mar 03 '11 at 04:13
  • substr is not faster than unpack. Did you read the whole benchmark post? – socket puppet Mar 03 '11 at 04:22
  • now i did.... im running the unpack method now... well see how it goes. i dont know why this script is taking over 2 hours to finish.... the dirty validation sql at the end doesn't take more than a few seconds – CheeseConQueso Mar 03 '11 at 04:45
  • 1
    @CheeseConQueso: I included `Benchmark`, and an example of how to use it, in my answer so it'd be instructive. To understand why your program is taking a long time to run, [`Benchmark`](http://perldoc.perl.org/Benchmark.html) and [`Devel::DProf`](http://perldoc.perl.org/Devel/DProf.html) are invaluable tools in your Perl arsenal. – Ian C. Mar 03 '11 at 14:26
4

A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:

         Rate unpack substr regexp
 unpack 2.59/s     --   -59%   -67%
 substr 6.23/s   141%     --   -21%
 regexp 7.90/s   206%    27%     --

Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it's the same input size as the data you're working with.

The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.

The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.

#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;

sub use_substr() {
    print "use_substr(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = (substr($line,0,1),
                           substr($line,1,15),
                           substr($line,16,15),
                           substr($line,31,1),
                           substr($line,32,30),
                           substr($line,62,20),
                           substr($line,82,2),
                           substr($line,84,5),
                           substr($line,93,9),
                           substr($line,108,4),
                           substr($line,112,6),
                           substr($line,118,2),
                           substr($line,120,1),
                           substr($line,121,1),
                           substr($line,122,3),
                           substr($line,125,3),
                           substr($line,128,5),
                           substr($line,135,4));
       #print "use_substr(): \$lname = $lname\n";
       #print "use_substr(): \$gpa   = $gpa\n";
    }    
    close(F);
    return 1;
}

sub use_regexp() {
    print "use_regexp(): New itteration\n";
    my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        if ( $line =~ m/$pattern/o ) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = ( $1,
                                $2,
                                $3,
                                $4,
                                $5,
                                $6,
                                $7,
                                $8,
                                $9,
                                $10,
                                $11,
                                $12,
                                $13,
                                $14,
                                $15,
                                $16,
                                $17,
                                $18);
            #print "use_regexp(): \$lname = $lname\n";
            #print "use_regexp(): \$gpa   = $gpa\n";
        }
    }    
    close(F);
    return 1;
}

sub use_unpack() {
    print "use_unpack(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = unpack(
               "(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
               );
        #print "use_unpack(): \$lname = $lname\n";
        #print "use_unpack(): \$gpa   = $gpa\n";
    }
    close(F);
    return 1;
}

# Benchmark it
my $itt = 50;
cmpthese($itt, {
        'substr' => sub { use_substr(); },
        'regexp' => sub { use_regexp(); },
        'unpack' => sub { use_unpack(); },
    }
);
exit(0)
Ian C.
  • 3,783
  • 2
  • 24
  • 33
0

Do a split on each line, like this:

my @values = split(/\s/,$line);

and then work with your values.

Geo
  • 93,257
  • 117
  • 344
  • 520
  • Further it will only work if the data is space separated, which as the OP uses one character from position zero, then 15 characters from position 1, then 15 characters from position 16, it doesn't appear that it is, unless my math is incorrect. – Joel Berger Mar 02 '11 at 22:06
  • i thought about a split, but thought that it was only used for fixed length breaks or characters as its discriminator. i think the breaks between fields are too varied for split to work unless its wrapped in some other logical test – CheeseConQueso Mar 03 '11 at 04:20
0

You could do something like:

while ($line = <F>){
   if ($line =~ /(.{1}) (.{15}) ........ /){
     $nom_stat = $1;
     $lname = $2;
     ...
   }
}

I think it's faster than your substr suggestion, but I'm not sure whether it's the fastest solution, but I think it might very well be.

markijbema
  • 3,985
  • 20
  • 32
  • this looks cryptic to me - not used to that syntax. what is this attempt doing in english? – CheeseConQueso Mar 03 '11 at 04:18
  • It's a regex, dot is any one character, and the number in braces is an occurrence count. In other words: 1 character, space, 15 characters [etc]. I still wouldn't do it this way though - use unpack(). – RET Mar 03 '11 at 07:15
  • Wow, I did not expect it to be as slow as Ian C showed. I'm rather surprised really, I thought it would've at least been faster than substr... – markijbema Mar 03 '11 at 09:47