5

We've got a mature body of code that loads data from files into a database. There are several file formats; they are all fixed-width fields.

Part of the code uses the Perl unpack() function to read fields from the input data into package variables. Business logic is then able to refer to these fields in a 'human-readable' way.

The file reading code is generated from a format description once, prior to reading a file.

In sketch form, the generated code looks like this:

while ( <> ) {

    # Start of generated code.

    # Here we unpack 2 fields, real code does around 200.
    ( $FIELDS::transaction_date, $FIELDS::customer_id ) = unpack q{A8 A20};

    # Some fields have leading space removed
    # Generated code has one line like this per affected field.
    $FIELDS::customer_id =~ s/^\s+//;

    # End of generated code.

    # Then we apply business logic to the data ...
    if ( $FIELDS::transaction_date eq $today ) {
        push @fields, q{something or other};
    }

    # Write to standard format for bulk load to the database.
    print $fh join( '|', @fields ) . q{\n} or die;
}

Profiling the code reveals that around 35% of the time is spent in the unpack and leading-space strip. The remaining time is spent in validating and transforming the data, and writing to the output file.

It appears that there is no single part of the business logic that takes more than 1-2% of the run time.

The question is - can we eke out a bit more speed from the unpacking and space stripping somehow? Preferably without having to refactor all the code that refers to the FIELDS package variables.

EDIT:

In case it makes a difference

$ perl -v
This is perl, v5.8.0 built for PA-RISC1.1
brian d foy
  • 129,424
  • 31
  • 207
  • 592
martin clayton
  • 76,436
  • 32
  • 213
  • 198
  • Would be interested to known if the use of a list of package variables on the left-hand side of the unpack is likely to be optimal. – martin clayton Oct 19 '09 at 21:47

6 Answers6

7

I've actually dealt with this problem over and over again. Unpack is better than substr.

As far as stripping spaces goes, you're pretty much screwed. That regex hack is the "official" way to do it. You might be able to gain some efficiency by refining your unpack statements (assuming no data is longer than 4 digits, why unpack the full 12 digits of the field?), but otherwise, the parsing is just a p.i.t.a.

Good luck with your flat data. Fricking legacy junk, how I hates it.

Community
  • 1
  • 1
Satanicpuppy
  • 1,569
  • 8
  • 11
3

Are you sure that you are processor bound on this task? The computation is simple enough to suspect that the whole process is likely to be I/O bound. In which case optimizing for faster unpack won't gain you much time.

In case you're in fact processor bound, the problem as described seems pretty much parallelizable, but of course the devil is in the details of your business computation.

ttarchala
  • 4,277
  • 2
  • 26
  • 36
  • Pretty sure, yes. In the real code the I/O and unpacking are separated, unlike in the potted example above. Parallelism is a good call - we do that already. Would still like to spend less time on chopping up strings though. – martin clayton Oct 19 '09 at 14:38
  • 2
    For me the biggest bottleneck is the database itself; if you're dealing with flat data, you may well be dealing with a horrible ancient flat database. I get around the problem by pulling the data out at intervals to a modern database, but for "real time" data it's almost always the database that's causing the slowdowns. – Satanicpuppy Oct 19 '09 at 16:06
  • Have to agree. Overall our process is spending most of the time slurping data into the database. We are able to use parallel processes to speed this up. It struck me that as such a large fraction of the 'Perl-time' was spent in such a small part of the code there might be scope for an easy optimise here. – martin clayton Oct 19 '09 at 21:40
1

Yes. Extracting using substr is likely to be the fastest way to do it. That is:

$FIELDS::transaction_date = substr $_, 0, 8;
$FIELDS::customer_id      = substr $_, 8, 20;

is likely to be faster. Now, if I were handwriting this code, I would not give up unpack, but if you are generating the code, you might as well give it a shot and measure.

See also the answers to Is Perl’s unpack() ever faster than substr()?

As for stripping leading spaces, s/^\s+// is likely to be the fastest method.

Update: It is hard to say anything definite without being able to run benchmarks. However, how about:

my  $x  = substr $_, 0, 8;

for fields that do not need any trimming and

my ($y) = substr($_, 8, 20) =~ /\A\s+(.+?)\s+\z/;

that do need trimming?

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • I'll experiment and report back. – martin clayton Oct 19 '09 at 14:14
  • 3
    One thing to note is that unpack 'A' strips trailing spaces 'for free'. – martin clayton Oct 19 '09 at 15:19
  • Looks promising so far. Basic benchmarking shows a series of per-field substrs and strip trailing regexes is about 50% faster for us than using one unpack. Tests are passing, but my screen is full of Perlish warnings, so I'm not quite there yet. – martin clayton Oct 19 '09 at 15:30
  • @martin clayton: What kind of warnings? – Sinan Ünür Oct 19 '09 at 15:35
  • Nothing a judicial defaulting || couldn't take care of. We have some data that doesn't fill the entire record. I'll experiment with the different trailing space strips, because with none the substr version is much quicker than unpack. – martin clayton Oct 19 '09 at 15:54
  • 1
    .+ is greedy, so you'll leave one space at the end of your string. You want .+? there. – runrig Oct 22 '09 at 15:48
1

This could be also something for XS - so that you use a C function to change the data. I could imagine that this is a lot faster than anything else as you can manually control when data is really copied.
The build process will get more difficult as you have a dependency to the C compiler and will require additional integration steps.

weismat
  • 7,195
  • 3
  • 43
  • 58
  • Thanks. Probably for my project the gain would not justify the extra complexity. As remarked by Satanicpuppy, there are other parts of the process that are more expensive than this. Could work for someone else with a similar problem though. – martin clayton Oct 19 '09 at 21:34
  • 3
    If most of the CPU time is spent in the regex engine and a builtin function, then most of the time is spent in C land (i.e. not on walking optrees and handling perl data structures) anyway. And it takes a good programmer to beat the people who wrote the perl VM. unpack is fast! – tsee Oct 19 '09 at 23:27
1

Simply make it go in parallel. It's trivial, and on any even remotely modern machine it will be faster.

0

The benchmark of a substr-based version of our code suggested that it might be around 50% faster than our existing unpack. Comparing the codes in place in the actual application, the substr version gave us a 16% reduction in run time. This is close to what we'd hoped for based on the benchmark and the profiling referred to in the question.

This optimisation might be useful for us. But, we have a migration to a new OS on the horizon, so we'll wait and see how the codes perform there before proceeding. We've added a test to keep an eye on the comparative benchmarks.

The idiom we have now is:

$FIELDS::transaction_date = substr( $_, 0, 8 ) || '';
$FIELDS::transaction_date =~ s/\s+\z//;
$FIELDS::customer_id = substr( $_, 8, 20 ) || '';
$FIELDS::customer_id =~ s/\s+\z//;

Followed by selective leading space removal as before.

Thanks for all the answers so far. I'll accept Sinan's because it worked for us, despite appearing to be 'just plain wrong'.

martin clayton
  • 76,436
  • 32
  • 213
  • 198