4

I have a file that I need to parse in the following format. (All delimiters are spaces):

field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value.

I am familiar with how to parse a single line fixed-width file, but am stumped with how to handle multiple lines.

NeonD
  • 169
  • 1
  • 11

4 Answers4

8
#!/usr/bin/env perl

use strict; use warnings;

my (%fields, $current_field);

while (my $line = <DATA>) {
    next unless $line =~ /\S/;

    if ($line =~ /^ \s+ ( \S .+ )/x) {
        if (defined $current_field) {
            $fields{ $current_field} .= $1;
        }
    }
    elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
        $current_field = $1;
        $fields{ $current_field } = $2;
    }
}

use Data::Dumper;
print Dumper \%fields;

__DATA__
field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value.
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • 2
    Thank you! I changed the first instance of `.+` to `.+?` to make the pattern matching ungreedy. This helped me with values that contained a ":" character. – NeonD Dec 14 '11 at 21:44
  • What if a multi-line value contains a colon? – TLP Dec 15 '11 at 00:11
  • @TLP See my fix. Also, if the file format specifies that values may only begin after a certain column, that would make the job easier. – Sinan Ünür Dec 15 '11 at 12:41
  • @SinanÜnür Looks good. I still think `unpack` might be the better tool, though. He did say it was fixed width, so columns should be aligned. – TLP Dec 15 '11 at 12:51
  • @TLP Thanks. Fields might be fixed width in a file but the starting position of the value field might differ across files. In that case, one can auto-detect where the value field begins etc but I didn't think it was worth the effort for now. – Sinan Ünür Dec 15 '11 at 13:00
4

Fixed-width says unpack to me. It is possible to parse with regexes and split, but unpack should be a safer choice, as it is the Right Tool for fixed width data.

I put the width of the first field to 12 and the empty space between to 13, which works for this data. You may need to change that. The template "A12A13A*" means "find 12 then 13 ascii characters, followed by any length of ascii characters". unpack will return a list of these matches. Also, unpack will use $_ if a string is not supplied, which is what we do here.

Note that if the first field is not fixed width up to the colon, as it appears to be in your sample data, you'll need to merge the fields in the template, e.g. "A25A*", and then strip the colon.

I chose array as the storage device, as I do not know if your field names are unique. A hash would overwrite fields with the same name. Another benefit of an array is that it preserves the order of the data as it appears in the file. If these things are irrelevant and quick lookup is more of a priority, use a hash instead.

Code:

use strict;
use warnings;
use Data::Dumper;

my $last_text;
my @array;
while (<DATA>) {
    # unpack the fields and strip spaces
    my ($field, undef, $text) = unpack "A12A13A*";  
    if ($field) {   # If $field is empty, that means we have a multi-line value
            $field =~ s/:$//;             # strip the colon
        $last_text = [ $field, $text ];   # store data in anonymous array
        push @array, $last_text;          # and store that array in @array
    } else {        # multi-line values get added to the previous lines data
        $last_text->[1] .= " $text"; 
    }
}

print Dumper \@array;

__DATA__
field name 1:            Multiple word value.
field name 2:            Multiple word value along
                         with multiple lines.
field name 3:            Another multiple word
                         and multiple line value
                         with a third line

Output:

$VAR1 = [
          [
            'field name 1:',
            'Multiple word value.'
          ],
          [
            'field name 2:',
            'Multiple word value along with multiple lines.'
          ],
          [
            'field name 3:',
            'Another multiple word and multiple line value with a third line'
          ]
        ];
TLP
  • 66,756
  • 10
  • 92
  • 149
2

You could do this:

#!/usr/bin/perl

use strict;
use warnings;

my @fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";

for (<$fh>) {
    if (/^\s/) {
        $fields[$#fields] .= $_;    
    } else {
        push @fields, $_;
    }
}

close $fh;

If the line starts with white space, append it to the last element in @fields, otherwise push it onto the end of the array.

Alternatively, slurp the entire file and split with look-around:

#!/usr/bin/perl

use strict;
use warnings;

$/=undef;

open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";

my @fields = split/(?<=\n)(?!\s)/, <$fh>;

close $fh;

It's not a recommended approach though.

flesk
  • 7,439
  • 4
  • 24
  • 33
0

You can change delimiter:

$/ = "\nfield name";

while (my $line = <FILE>) {

    if ($line =~ /(\d+)\s+(.+)/) {
        print "Record $1 is $2";
    }
}  
musefan
  • 47,875
  • 21
  • 135
  • 185