Values from IF statement regex match (Perl)

Question

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.

foreach my $line (split("\n", $file)) {
    if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
        my ($val1, $val2, $val3) = ($1, $2, $3);

        # $val's used here
    }
}

I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?

http://stackoverflow.com/questions/2304577/how-can-i-store-regex-captures-in-an-array-in-perl ? — Scroog1, Apr 18 '12 at 14:47
I always liked http://stackoverflow.com/questions/874915/perl-extracting-data-from-text-using-regex where they use split - your regexp seems to be a candidate. — Konerak, Apr 18 '12 at 14:47

score 9 · Accepted Answer · answered Apr 18 '12 at 14:48

9

A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore

if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
   ...
}

However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:

if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
   ...
}

Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:

while( my $line = <$fh> ) {
   ...
}

answered Apr 18 '12 at 14:48

LeoNerd

8,344
1
29
36

Instead of `split ' '` I tend to use `split /\s+/` – Leonardo Herrera Apr 18 '12 at 14:56
1

@LeonardoHerrera Why? All it does is preserve a leading null field if there is leading white space. – TLP Apr 18 '12 at 15:02
Be careful of swapping the regexp for the split, they don't both mean the same thing, consider what both would produce if `$line = 'abc def';` – Ven'Tatsu Apr 18 '12 at 15:18
Thanks for your input. Your first example does simplify the code by, but readability is debatable. The regex in my example is simplified. Some of my tables don't have consist delimiters. The initial file is slurped into a string and portions of it are removed to prevent regex match conflicts. That's why the file ends up being read as a string in my example. – kaspnord Apr 18 '12 at 15:39
@TLP - you have a good point. I forgot that `split` treats the space character differently than other characters. – Leonardo Herrera Apr 18 '12 at 20:59
@LeonardoHerrera Well, as many other perl features, `" "` is magic, meaning it has hidden functionality intended for user convenience. I believe this particular feature is intended to emulate a certain awk feature. – TLP Apr 18 '12 at 21:08

score 2 · Answer 2 · answered Apr 18 '12 at 15:16

I assume that this is not your actual code, because if so, it will not work:

foreach my $line (split("\n", $file)) {
    if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
        my ($val1, $val2, $val3) = ($1, $2, $3);
    }
# all the $valX variables are now out of scope
}

You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:

"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"

Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.

If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.

open my $fh, "<", "file.txt" or die $!;
my @stored;
while (<$fh>) {
    my @vals = split;
    push(@stored, \@vals) if @vals; # ignore empty values
}

This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.

Thanks for your input. I have updated the question to clarify the scope of the $val variables. Does the `split` in your example handle a varying number of white spaces? Unfortunately, I can't explicitly name the vals using your example. See my comment on LeoNerd's post regarding file handling. — kaspnord, Apr 18 '12 at 15:49
@kaspnord Yes, splitting into an array will hold any number of matches. Although if the number of whitespaces is a concern, e.g. if "a\t\tc" is supposed to be `$val1 = "a"; $val3 = "c"` (skipping `$val2`), then no. But then you are probably better off using a CSV module. If you do not know the number of variables needed, you use an array. If you feel it is necessary, you can easily count your array elements and assign them to named variables later. Slurping the file is probably not necessary (it usually isn't), and will decrease performance. But that's another question. — TLP, Apr 18 '12 at 16:11

score 1 · Answer 3 · answered Apr 18 '12 at 14:52

1

Looks like you are just using a table with a space delimiter.You can use the split function:

@valuearray = split(" ", $line)

And then address the elements as:

@valuearray[0] ,@valuearray[1] etc..

answered Apr 18 '12 at 14:52

byrondrossos

2,107
1
15
19

Thanks for your input. The example I provided is simplified - the delimiters in my report are actually not consistent. – kaspnord Apr 18 '12 at 15:44
@kaspnord split supports full regular expressions.You can use any delimiter, even different delimiters in the same table. – byrondrossos Apr 18 '12 at 18:13

Values from IF statement regex match (Perl)

3 Answers3