2

After trying N+1 times with regex in Perl: I have the following problem: I need to retrieve this:

  232310..1.3      3213   2.4  "$250 For My jacket" (2012)

I am trying to retrieve it via:

if ( $line=~m/^\s+(\d+|\.+)\s+(\d+)\s+(\d+|\.+)\s+(\^"&(\w*|\s*|\D*)"$)\s*\((\d+)\s*/){
        $ID=$1;
        $Amount=$2;
        $Size=$3;
        $Item=$4;
        $Year=$5;

It does not work

codaddict
  • 445,704
  • 82
  • 492
  • 529
Majic Johnson
  • 341
  • 3
  • 15

3 Answers3

6

(\d+|\.+) means either one or more digits or one or more periods. But what you want is ([\d.]+) which means one or more of digits or periods.

Similar problem exits for capturing size and item as well. Also you're incorrectly using the start anchor (^) and the end anchor($).

You can try:

^\s+([\d.]+)\s+(\d+)\s+([\d.]+)\s+"([^"]+)"\s*\((\d+)\s*

See it

codaddict
  • 445,704
  • 82
  • 492
  • 529
2

codaddict's solution is fine if all of your 4th row entries are quoted. A different approach is to use a CSV parser (which you'd probably need to install from CPAN first), for instance:

#!/usr/bin/env perl

use strict;
use warnings;

use Text::CSV_XS;

my $csvr = new Text::CSV_XS({
  sep_char => ' ',
  eol => $/
});

my $csvw = new Text::CSV_XS({
  sep_char => ',',
  eol => $/
});

$csvw->print( *STDOUT, [ qw(ID Amount Size Item Year) ]);

while (my $row = $csvr->getline(*ARGV))
{
  $csvw->print( *STDOUT, [ grep { /./ } @$row ] );
}

}

When given the input

232310..1.3      3213   2.4  "$250 For My jacket" (2012)

this will produce:

232310..1.3,3213,2.4,"$250 For My jacket",(2012)

A further step is to use DBD::CSV, which allows you to perform SQL queries on your input file.

reinierpost
  • 8,425
  • 1
  • 38
  • 70
  • This is really nice I can see your point but can't we just print $id,$Amount,$Size etc. Or is there something I am missing about the CSV files? – Majic Johnson Apr 19 '12 at 02:46
  • good comment about using a CSV parser - Text::CSV is possibly an easier alternative depending on the environment - it will use XS if installed, pure-perl if not. – plusplus Apr 19 '12 at 08:25
  • @MajicJohnson - `$id` etc don't exist in this code. For readability you could extract them in the loop from the `$row` arrayref as follows: `my ($ID, $Amount, $Size, $Item, $Year) = @{$row};` – plusplus Apr 19 '12 at 08:28
  • @Majic Johnson: the idea is that you'd be more robust: in general you'd be better prepared against missing something in the CSV files, and it would be easier to correct for anything you'd missed by changing the parser's configuration. If the format is really simple (e.g. quotes *always* appear in the 4th column and *never* anywhere else, and *never* inside the values in the 4th column) and you can be sure about this, then regex-based matching will do just fine. – reinierpost Apr 19 '12 at 09:52
1

Same fix as codaddict's, but showing how you can make regexes more readable - the 'x' option is very useful for longer regexes and multiple capture variables.

(I would have posted this as a comment, but for the limited formatting options)

my ( $id, $amount, $size, $item, $year ) = $line =~ m{
    ^
    \s+
    ([\d.]+)        # field 1, e.g. 232310..1.3
    \s+
    (\d+)           # field 2, e.g. 3213
    \s+
    ([\d.]+)        # field 3, e.g. 2.4
    \s+
    "([^"]+)"       # field 4, e.g. "$250 For My jacket"
    \s*
    \((\d+)\)       # field 5, e.g. (2012)
    \s*
}x or die "Line does not match!";  # always check that a regex actually succeeded!
plusplus
  • 1,992
  • 15
  • 22