2

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Greg
  • 2,221
  • 4
  • 19
  • 11
  • 1
    Could you be a bit more specific about what your delimiters are? – Paul Williams Sep 29 '09 at 20:05
  • @Paul: That's the problem. This output is being generated by an application (splunk). I can't set the delimiters, and they seem to be just two whitespace characters between each column - but There's whitespace that extends past that. – Greg Sep 29 '09 at 20:41
  • 3
    @Greg: This really doesn't look like a "delimited" file at all. These are fixed-width columns. FM's solution below is really the cleanest way to parse this. – Clinton Pierce Sep 29 '09 at 21:02
  • @Paul - Ok got it. Thanks. I just always refer to anything that's separated by something, delimited. But that makes a lot more sense. – Greg Sep 30 '09 at 12:35
  • I'm not sure I agree, that one is simple, this one implies that the fixed-width format is for a database or something more complex. – Evan Carroll Jul 15 '10 at 04:07

6 Answers6

14

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
FMc
  • 41,963
  • 13
  • 79
  • 132
  • 3
    Everyone forgets that Perl has pack, but it's really handy and I should use it more myself. I was just editing that chapter for the next edition of Effective Perl Programming. :) – brian d foy Sep 29 '09 at 23:26
  • 1
    `split m'\s+'` would highlight better. – Brad Gilbert Sep 29 '09 at 23:56
  • According to the perldoc - "The string is broken into chunks described by the TEMPLATE." These chunks are inserted into the @fields_fixed array, correct? – Greg Sep 30 '09 at 13:53
  • while ($line = ) { if ($x < 2) { $x++; } else { chomp $line; @fields_whitespace = split m'\s+', $line; @fields_fixed = unpack('a10 a10 a12 a28', $line); print @fields_fixed, "\n"; $x++; } } That's what I have - I cannot access @fields_fixed outside the block - am I missing some basic ideal of programming that I should remember? I know and understand scope - but am confused in this case. – Greg Sep 30 '09 at 14:19
3

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
0

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
-1

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

friedo
  • 65,762
  • 16
  • 114
  • 184
FModa3
  • 14,409
  • 1
  • 17
  • 10
  • 5
    I think he is talking about fixed width encoding. – Byron Whitlock Sep 29 '09 at 20:11
  • Used this method - works great, output: Time: 1253592000 Livereporter: Span: Bcreporter: Time: 1253678400 Livereporter: 86400 Span: 6183.000000 Bcreporter: Time: 1253764800 Livereporter: 86400 Span: 4486.000000 Bcreporter: Time: 1253851200 Livereporter: 36.000000 Span: 86400 Bcreporter: 10669.000000 Time: 1253937600 Livereporter: 0.000000 Span: 86400 Bcreporter: 9126.000000 Time: 1254024000 Livereporter: 0.000000 Span: 86400 Bcreporter: 2930.000000 Time: 1254110400 Livereporter: 0.000000 Span: 86400 Bcreporter: 2895.000000 Time: 1254196800 Livereporter: 0.000000 Span: 8828.000000 – Greg Sep 29 '09 at 20:36
  • 1
    You can't split on whitespace because some fields are empty. You lose the column order when you do this. – brian d foy Sep 29 '09 at 23:29
  • I'm going to test out the unpack solution in a few minutes. - Thanks! – Greg Sep 30 '09 at 12:37
-1

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

P Shved
  • 96,026
  • 17
  • 121
  • 165
  • I don't think that does what you think it does. There are two major errors in that one line: one in logic and one in syntax. – brian d foy Sep 29 '09 at 23:27
  • @brian d foy: thank you, fixed. Sorry for a low-quality answer. Anyway, `unpack` solution is way better. – P Shved Sep 30 '09 at 03:50
-2

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
  • 1
    Why are you chomping like that? And what do you think that prints? See the documentation for chomp for a clue. Not to be too mean about it, but if you're guessing and showing something you've never tried or even run, you should wait for a more experienced person to answer. – brian d foy Sep 29 '09 at 23:28