How can I extract columns from a fixed-width format in Perl?

Question

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

Could you be a bit more specific about what your delimiters are? — Paul Williams, Sep 29 '09 at 20:05
@Paul: That's the problem. This output is being generated by an application (splunk). I can't set the delimiters, and they seem to be just two whitespace characters between each column - but There's whitespace that extends past that. — Greg, Sep 29 '09 at 20:41
@Greg: This really doesn't look like a "delimited" file at all. These are fixed-width columns. FM's solution below is really the cleanest way to parse this. — Clinton Pierce, Sep 29 '09 at 21:02
@Paul - Ok got it. Thanks. I just always refer to anything that's separated by something, delimited. But that makes a lot more sense. — Greg, Sep 30 '09 at 12:35
I'm not sure I agree, that one is simple, this one implies that the fixed-width format is for a database or something more complex. — Evan Carroll, Jul 15 '10 at 04:07

FMc · Answer 1 · 2009-09-30T01:41:37.693

14

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

edited Sep 30 '09 at 01:41

answered Sep 29 '09 at 20:31

FMc

41,963
13
79
132

3

Everyone forgets that Perl has pack, but it's really handy and I should use it more myself. I was just editing that chapter for the next edition of Effective Perl Programming. :) – brian d foy Sep 29 '09 at 23:26
1

`split m'\s+'` would highlight better. – Brad Gilbert Sep 29 '09 at 23:56
According to the perldoc - "The string is broken into chunks described by the TEMPLATE." These chunks are inserted into the @fields_fixed array, correct? – Greg Sep 30 '09 at 13:53
while ($line = ) { if ($x < 2) { $x++; } else { chomp $line; @fields_whitespace = split m'\s+', $line; @fields_fixed = unpack('a10 a10 a12 a28', $line); print @fields_fixed, "\n"; $x++; } } That's what I have - I cannot access @fields_fixed outside the block - am I missing some basic ideal of programming that I should remember? I know and understand scope - but am confused in this case. – Greg Sep 30 '09 at 14:19

Evan Carroll · Answer 2 · 2010-07-15T03:59:19.613

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I've used this module in the past and the column detection is slick. — Demosthenex, Jul 15 '10 at 08:04

score 0 · Answer 3 · answered Sep 30 '09 at 13:35

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

score -1 · Answer 4 · edited Sep 29 '09 at 20:16

-1

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

edited Sep 29 '09 at 20:16

friedo

65,762
16
114
184

answered Sep 29 '09 at 20:08

FModa3

14,409
1
17
10

5

I think he is talking about fixed width encoding. – Byron Whitlock Sep 29 '09 at 20:11
Used this method - works great, output: Time: 1253592000 Livereporter: Span: Bcreporter: Time: 1253678400 Livereporter: 86400 Span: 6183.000000 Bcreporter: Time: 1253764800 Livereporter: 86400 Span: 4486.000000 Bcreporter: Time: 1253851200 Livereporter: 36.000000 Span: 86400 Bcreporter: 10669.000000 Time: 1253937600 Livereporter: 0.000000 Span: 86400 Bcreporter: 9126.000000 Time: 1254024000 Livereporter: 0.000000 Span: 86400 Bcreporter: 2930.000000 Time: 1254110400 Livereporter: 0.000000 Span: 86400 Bcreporter: 2895.000000 Time: 1254196800 Livereporter: 0.000000 Span: 8828.000000 – Greg Sep 29 '09 at 20:36
1

You can't split on whitespace because some fields are empty. You lose the column order when you do this. – brian d foy Sep 29 '09 at 23:29
I'm going to test out the unpack solution in a few minutes. - Thanks! – Greg Sep 30 '09 at 12:37

P Shved · Answer 5 · 2009-09-30T16:58:06.347

-1

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

edited Sep 30 '09 at 16:58

answered Sep 29 '09 at 20:18

P Shved

96,026
17
121
165

I don't think that does what you think it does. There are two major errors in that one line: one in logic and one in syntax. – brian d foy Sep 29 '09 at 23:27
@brian d foy: thank you, fixed. Sorry for a low-quality answer. Anyway, `unpack` solution is way better. – P Shved Sep 30 '09 at 03:50

score -2 · Answer 6 · answered Sep 29 '09 at 20:10

-2

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

answered Sep 29 '09 at 20:10

Byron Whitlock

52,691
28
123
168

1

Why are you chomping like that? And what do you think that prints? See the documentation for chomp for a clue. Not to be too mean about it, but if you're guessing and showing something you've never tried or even run, you should wait for a more experienced person to answer. – brian d foy Sep 29 '09 at 23:28

How can I extract columns from a fixed-width format in Perl?

6 Answers6

Linked