Regular Expression Perl period quotation

Question

After trying N+1 times with regex in Perl: I have the following problem: I need to retrieve this:

  232310..1.3      3213   2.4  "$250 For My jacket" (2012)

I am trying to retrieve it via:

if ( $line=~m/^\s+(\d+|\.+)\s+(\d+)\s+(\d+|\.+)\s+(\^"&(\w*|\s*|\D*)"$)\s*\((\d+)\s*/){
        $ID=$1;
        $Amount=$2;
        $Size=$3;
        $Item=$4;
        $Year=$5;

It does not work

score 6 · Accepted Answer · answered Apr 16 '12 at 08:44

6

(\d+|\.+) means either one or more digits or one or more periods. But what you want is ([\d.]+) which means one or more of digits or periods.

Similar problem exits for capturing size and item as well. Also you're incorrectly using the start anchor (^) and the end anchor($).

You can try:

^\s+([\d.]+)\s+(\d+)\s+([\d.]+)\s+"([^"]+)"\s*\((\d+)\s*

See it

answered Apr 16 '12 at 08:44

codaddict

445,704
82
492
529

Thanks it worked perfectly!...I check out the link too! Sounds like a good tool. Seems like ruby is using the same syntax for regex as perl too!:) – Majic Johnson Apr 16 '12 at 09:25
I didn't quite understant "([^"]+)", Can you please explain? – Majic Johnson Apr 16 '12 at 09:42
«"», one or more characters other than «"», then «"». The part within the quotes is captured. – ikegami Apr 16 '12 at 10:11
The link is wonderfully useful for everyone who want to test regular expression on the fly! Thanks again. I keep using the link "See it" above! – Majic Johnson Apr 19 '12 at 21:19

score 2 · Answer 2 · answered Apr 17 '12 at 12:35

2

codaddict's solution is fine if all of your 4th row entries are quoted. A different approach is to use a CSV parser (which you'd probably need to install from CPAN first), for instance:

#!/usr/bin/env perl

use strict;
use warnings;

use Text::CSV_XS;

my $csvr = new Text::CSV_XS({
  sep_char => ' ',
  eol => $/
});

my $csvw = new Text::CSV_XS({
  sep_char => ',',
  eol => $/
});

$csvw->print( *STDOUT, [ qw(ID Amount Size Item Year) ]);

while (my $row = $csvr->getline(*ARGV))
{
  $csvw->print( *STDOUT, [ grep { /./ } @$row ] );
}

}

When given the input

232310..1.3      3213   2.4  "$250 For My jacket" (2012)

this will produce:

232310..1.3,3213,2.4,"$250 For My jacket",(2012)

A further step is to use DBD::CSV, which allows you to perform SQL queries on your input file.

answered Apr 17 '12 at 12:35

reinierpost

8,425
1
38
70

This is really nice I can see your point but can't we just print $id,$Amount,$Size etc. Or is there something I am missing about the CSV files? – Majic Johnson Apr 19 '12 at 02:46
good comment about using a CSV parser - Text::CSV is possibly an easier alternative depending on the environment - it will use XS if installed, pure-perl if not. – plusplus Apr 19 '12 at 08:25
@MajicJohnson - `$id` etc don't exist in this code. For readability you could extract them in the loop from the `$row` arrayref as follows: `my ($ID, $Amount, $Size, $Item, $Year) = @{$row};` – plusplus Apr 19 '12 at 08:28
@Majic Johnson: the idea is that you'd be more robust: in general you'd be better prepared against missing something in the CSV files, and it would be easier to correct for anything you'd missed by changing the parser's configuration. If the format is really simple (e.g. quotes *always* appear in the 4th column and *never* anywhere else, and *never* inside the values in the 4th column) and you can be sure about this, then regex-based matching will do just fine. – reinierpost Apr 19 '12 at 09:52

score 1 · Answer 3 · answered Apr 17 '12 at 12:01

Same fix as codaddict's, but showing how you can make regexes more readable - the 'x' option is very useful for longer regexes and multiple capture variables.

(I would have posted this as a comment, but for the limited formatting options)

my ( $id, $amount, $size, $item, $year ) = $line =~ m{
    ^
    \s+
    ([\d.]+)        # field 1, e.g. 232310..1.3
    \s+
    (\d+)           # field 2, e.g. 3213
    \s+
    ([\d.]+)        # field 3, e.g. 2.4
    \s+
    "([^"]+)"       # field 4, e.g. "$250 For My jacket"
    \s*
    \((\d+)\)       # field 5, e.g. (2012)
    \s*
}x or die "Line does not match!";  # always check that a regex actually succeeded!

Regular Expression Perl period quotation

3 Answers3