combine keys of hashes for output (outer join of hashes)

Question

I'm analysing a log file with Perl 5.8.8.^[1] I'm searching for days that expose some of two trigger patterns, maybe one of it, maybe both (I changed the actual patterns in the code snippet shown below). I'm interested in the count of occurrences per day, next step will be to make a spreadsheet of it, that's why the output formatting with tabs.

Because only one of the patterns may occur in a day, I need a way to combine the keys of both hashes. I did by generating a new hash. Is there a built-in function for that? I searched the web and stack overflow without any result, the only hit I got here was Build a string from 2 hashes, but in that case the key sets were identical.

#!/usr/bin/perl -w
use strict;
use warnings;
use locale;

# input analysis: searching for two patterns:
my %pattern_a = ();
my %pattern_b = ();
foreach my $line (<>) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        if ($what =~ m/beendet/) {
            $pattern_a{$day} ++;
        } elsif ($what =~ m/ohne/) {
            $pattern_b{$day} ++;
        }
    }
}

# generate the union of hash keys:        <-- In Question
my %union = ();
$union{$_} = 1 for keys %pattern_a;
$union{$_} = 1 for keys %pattern_b;

# formatted output sorted by day:
foreach my $day (sort keys %union) {
    print join "\t", $day, 
            ($pattern_a{$day} || 0), 
            ($pattern_b{$day} || 0)."\n";
}

The expected output would look like this:

2017-02-01      0       1
2017-02-18      0       592
2017-02-19      2       0

^[1] I'm aware that this Perl version is quite outdated. But I'm using Perl rarely, but when I do, it has to go fast. So figuring out Perl versions and so on gets done later. But the Perl version is not so important for the actual question, at least I hope so...

Don't name stuff `$a` and `$b`. Those variables are reserved for use in `sort` blocks. What exactly do you mean by combine? Can you show how you want the final output to be? I think you picked a complicated approach. This can be simplified, but we need to see what you are going for. The structure of the data that works best depends on the end-product. — simbabque, Feb 23 '17 at 11:10
@simbabque of course, $a and $b are bad, fixed, thanks for pointing me onto this. — Wolf, Feb 23 '17 at 11:16
I'm going to sound like a nit-pick but all-caps variable names are not better. They look like constants. Besides, single-letter variables are not speaking, so it's hard to guess what they are for. Why not rewrite that part as `say join "\t", $day, $pattern_a{$day} // 0, $pattern_b{$day} // 0;` and then you don't need that var any more. You also don't need to use those parenthesis `()` because you used the low-stick `or` instead of `||`, which is kind of wrong in this case, because it will also be false if there's already a `0` in that variable anyway. `//` is _defined-or_, which is better. — simbabque, Feb 23 '17 at 11:21
@simbabque thanks for the suggestion. I had problems with `//` that I don't understand, maybe the code for output is now better. — Wolf, Feb 23 '17 at 11:31

melpomene · Answer 1 · 2017-02-23T11:44:14.223

2

Wouldn't it be easier to use a single hash?

#!/usr/bin/perl
use strict;
use warnings;

my %stats;

while (my $line = readline) {
    my ($day, $pattern) = $line =~ /^(\d{4}-\d{2}-\d{2}).*(beendet|ohne)/
        or next;

    $stats{$day}{$pattern}++;
}

for my $day (sort keys %stats) {
    printf "%s\t%d\t%d\n",
        $day,
        $stats{$day}{beendet} // 0,
        $stats{$day}{ohne} // 0;
}

If you're using a perl before 5.10, replace // by ||; it makes no effective difference in this case. (But consider upgrading: 5.8.8 is from 2006. It's now more than a decade old. The officially maintained perl versions are 5.22 (2015) and 5.24 (2016).)

edited Feb 23 '17 at 11:44

answered Feb 23 '17 at 11:25

melpomene

84,125
8
85
148

*`easier to use a single hash`* -- Yes maybe, but I'm using Perl only from time to time, so I have first get used to it again... – Wolf Feb 23 '17 at 11:37
Would you say, the question is asked wrong when using Perl? Is maybe that the reason I found nothing? I mean, the output *is* associative data and that's what hashes are for... (And yes, I'm looking for a newer Perl version right now) – Wolf Feb 23 '17 at 12:02
...this runs slower than simbabque's first solution, even If I cascade matching date and matching `/(beendet|ohne)/` – Wolf Feb 23 '17 at 13:22
having the patterns concentrated seems to be good (but how to avoid the repetition for the output?) – Wolf Feb 23 '17 at 13:25

simbabque · Accepted Answer · 2017-02-23T13:41:41.447

It's easier to structure your data first by day, then by pattern. That can be done using a hash reference.

use strict;
use warnings;

my %matches;
while ( my $line = <DATA> ) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        if ($what =~ m/beendet/) {
            $matches{$day}->{a} ++;
        } elsif ($what =~ m/ohne/) {
            $matches{$day}->{b} ++;
        }
    }
}

# formatted output sorted by day:
foreach my $day (sort keys %matches) {
    print join(
        "\t",
        $day,
        $matches{$day}->{a} || 0,
        $matches{$day}->{b} || 0,
    ), "\n";
}

__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los

That program produces output as follows

2017-02-01  0   1
2017-02-02  1   0
2017-02-03  0   1

To understand the data structure, you can use Data::Dumper to output it (though I suggest using Data::Printer instead, as that's intended for human consumption and not as a serialization).

use Data::Dumper;
print Dumper \%matches;
__END__

$VAR1 = {
          '2017-02-03' => {
                            'b' => 1
                          },
          '2017-02-02' => {
                            'a' => 1
                          },
          '2017-02-01' => {
                            'b' => 1
                          }
        };

As you can see, the data is structured first by date. Each key represents one day. Inside, there is an additional hash reference that only holds one key. That's the pattern. Later we iterate the day first. Then we get

{
    'b' => 1
}

in the first iteration. Then we iterate all the patterns. The above program does this not by actually iterating, but by explicitly stating each possible key. If it's there it's used. If it's not defined, it's set to 0 with the || operator.

The program can be further simplified to use arbitrary patterns. If you don't care about the order of the patterns in the output, include a header and you can easily add more patterns later.

I used a config hash for the patterns, and Text::Table to create the output.

use strict;
use warnings;
use Text::Table;

my %matches;
my %patterns = (
    beendet => qr/beendet/,
    ohne    => qr/ohne/,
    komplex => qr/foo\sbar?/, # or whatever
);
while ( my $line = <DATA> ) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        foreach my $name ( sort keys %patterns ) {
            if ( $what =~ $patterns{$name} ) {
                $matches{$day}->{$name}++ ;
                last;
            }
        }
    }
}

# formatted output sorted by day:
my @head = sort keys %patterns;
my $tb = Text::Table->new( 'Tag', @head );

foreach my $day (sort keys %matches) {
    $tb->load([ $day, map { $matches{$day}->{$_} || 0 } @head ]);
}

print $tb;

__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los

This prints

Tag        beendet komplex ohne
2017-02-01 0       0       1   
2017-02-02 1       0       0   
2017-02-03 0       0       1

If you don't want to install an additional module, maybe just create a CSV file. Since you're from Germany, I suggest a semicolon ; as the separator, because German Excel uses that as the default.

Here is a verbose example of how to do this instead of Text::Table.

my @head = sort keys %patterns;
print join( ';', @head ), "\n";
foreach my $day (sort keys %matches) {
    my @cols;
    push @cols, $matches{$day}->{$_} || 0 for @head;
    print join ';', $day, @cols;
    print "\n";
}

And the output is

beendet;komplex;ohne
2017-02-01;0;0;1
2017-02-02;1;0;0
2017-02-03;0;0;1

But you should also look into Text::CSV if you don't want this to go to the screen.

Sorry I did not mention that I'm using Perl 5(.8.8), maybe that's why I'm having difficulties with `say` and `//`? — Wolf, Feb 23 '17 at 11:40
That `each %patterns` code looks broken. If `$pattern` matches, the loop is aborted with `last`, which leaves the `%patterns` iterator somewhere in the middle. The next `$line` will then only try the remaining patterns. — melpomene, Feb 23 '17 at 11:41
...thanks for mentioning `__DATA__` I'm restarting with Perl once a year ;) — Wolf, Feb 23 '17 at 11:41
@melpomene hmm that's true. I keep forgetting about that with `each`. I'll change it to something else. — simbabque, Feb 23 '17 at 11:43
@Wolf well if you have 5.8.8 you're missing out on a lot of good stuff. The `//` is not there indeed. But you can just use `||` instead. `say` is like `print` with a newline attached. Please include that detail into the question. — simbabque, Feb 23 '17 at 11:44
I've fixed the `each` problem and gotten rid of the newer features. — simbabque, Feb 23 '17 at 11:48
I'm going to do the addition. Would you tell me please if 'say' and '//' is available in Perl 5? I'm using ActiveState Perl and obviously have to switch to Strawberry soon. — Wolf, Feb 23 '17 at 11:49
@Wolf why do you think you need to switch? I figured if you have 5.8.8 you have some old server that came with an ancient Perl. If you're on Windows, ActivePerl has a very recent version, but you need to buy a licence to use it commercially. Both [`say`](http://perldoc.perl.org/perl5100delta.html#say()) and ['//`](http://perldoc.perl.org/perl5100delta.html#Defined-or-operator) are available from 5.10. It doesn't matter which Perl distribution for Windows you use, but you keeping up to date has advantages. :) — simbabque, Feb 23 '17 at 11:59
The suggestion of extracting the patterns is indeed great. But since I have difficulties to get `Text::Table` installed, wouldn't it be possible to get the desired without it? I mean I'm fine with just the *whatever*-separated data (as LibreOffice is able to import from a variety of formats), so also `Text::CSV` would be overkill. — Wolf, Feb 23 '17 at 12:58
@Wolf In that case, just use a semicolon `;`. That's the typical CSV separator in German speaking countries and Excel likes it. It's also saver than whitespace. Take the `join` from the first part of my answer and the loop plus the rest from the second part. — simbabque, Feb 23 '17 at 13:02
I'd suggest using `print join( "\t", "Datum", @head ), "\n";` and `print join( "\t", $day, map { $matches{$day}->{$_} || 0 } @head), "\n";` the multi-line version looks more complicated than the version using `map`. — Wolf, Feb 23 '17 at 14:44

combine keys of hashes for output (outer join of hashes)

2 Answers2