2

As a PHP programmer new to Perl working through 'Programming Perl', I have come across the following regex:

/^(.*?): (.*)$/;

This regex is intended to parse an email header and insert it into a hash. The email header is contained in a seperate .txt file and is in the following format:

From: person@site.com
To: email@site.com
Date: Mon, 1st Jan 2000 09:00:00 -1000
Subject: Subject here

The entire code I am using to work with this example regex is as follows:

use warnings;
use strict;

my %fields = ();

open(FILE, 'header.txt') or die('Could not open.');

while(<FILE>)
{
    /^(.*?): (.*)$/;
    $fields{$1} = $2;
}

foreach(%fields)
{
    print;
    print "\n";
}

Now, onto my question. I am unsure as to why the first subpattern has been modified to use a minimal quantifier. It is perhaps a small point to get hung up with, but I cannot see why it has been done.

Thanks for any replies.

pb149
  • 2,298
  • 1
  • 22
  • 30
  • 1
    Please note as a side issue that this mail processing seen here does not handle mail header continuation lines. The following is a legal and common header line: "Subject: This is a\n\tmultiline subject line\n". Also please accept your favorite answer. – Seth Robertson May 19 '11 at 20:03
  • 1
    Note that a minimal quantifier can be replaced by an ordinary (greedy) quantifier applied to a suitably restricted character class. In this case, consider /^([^:]*): (.*)$/, where the first group captures as many non-colon characters as possible. – Narveson May 19 '11 at 20:34
  • 1
    Just a side note, Programming Perl, despite being a classic is showing its age and does not include current thinking as to best practices in terms of style. Take a look at Modern Perl, Effective Perl Programming or Perl Best Practices to get this info. Newer versions of the Llama (Learning Perl) also feature these stylistic differences. PS, don't waste any time learning about pseudo-hashes, they've been removed from the language. – daotoad May 20 '11 at 05:00

6 Answers6

7

If it hadn't, there is a risk that it wouldn't match correctly if the value contains :<space>.

Imagine:

Subject: Urgent: Need a regex

Without the minimal match $1 would get Subject: Urgent, and $2 would be Need a regex.

Mat
  • 202,337
  • 40
  • 393
  • 406
6

Consider what happens if the subject is Subject: RE: reply to something.

A minimal quantifier will stop after Subject, but the greedy quantifier will match up to RE.

dsolimano
  • 8,870
  • 3
  • 48
  • 63
4

Because otherwise it will match all characters till last ':'. For example, without minimal quantifier this string:

Test: My: Weird: String

will match "Test: My: Weird" as the first group. But with minimal quantifier it will match only "Test".

Andrey Adamovich
  • 20,285
  • 14
  • 94
  • 132
4

The reason it uses a minimal quantifier is that it does not need to read any further than the colon. And in fact, it should not. I'm not sure what characters can exist in these keywords, but I am pretty sure . is a bit too wide, and that is the problem. If your fields contain any colons, a non-minimal regex would gobble it all up, for example:

Subject: Counter Strike: Source

If the first subpattern was greedy, it would grab Subject: Counter Strike, and not just Subject.

TLP
  • 66,756
  • 10
  • 92
  • 149
0

Without a minimal quantifier, wouldn't the first capture for the Date line be "Date: Mon, 1st Jan 2000 09:00:" instead of "Date:"?

Shea Levy
  • 5,237
  • 3
  • 31
  • 42
  • 1
    Not really, as original reg. ex. contains also space after ':'. So, in this case it will match same group with and without minimal quantifier. – Andrey Adamovich May 19 '11 at 17:45
0

Without that minimal quantifier, the value for $1 obtained from the "Date:" line would actually be "Date: Mon, 1st Jan 2000 09:00" due to Perl regex being greedy by default.

Brian Showalter
  • 4,321
  • 2
  • 26
  • 29
  • 1
    Not really, as original reg. ex. contains also space after ':'. So, in this case it will match same group with and without minimal quantifier. – Andrey Adamovich May 19 '11 at 17:46