4

I’m trying to filter an array of terms using another array in Perl. I have Perl 5.18.2 on OS X, though the behavior is the same if I use 5.010. Here’s my basic setup:

#!/usr/bin/perl
#use strict;
my @terms = ('alpha','beta test','gamma','delta quadrant','epsilon',
             'zeta','eta','theta chi','one iota','kappa');
my @filters = ('beta','gamma','epsilon','iota');
foreach $filter (@filters) {
    for my $ind (0 .. $#terms) {
        if (grep { /$filter/ } $terms[$ind]) {
            splice @terms,$ind,1;
        }
    }
}

This works to pull out the lines that match the various search terms, but the array length doesn’t change. If I write out the resulting @terms array, I get:

[alpha]
[delta quadrant]
[zeta]
[eta]
[theta chi]
[kappa]
[]
[]
[]
[]

As you might expect from that, printing scalar(@terms) gets a result of 10.

What I want is a resulting array of length 6, without the four blank items at the end. How do I get that result? And why isn’t the array shrinking, given that the perldoc page about splice says, “The array grows or shrinks as necessary.”?

(I’m not very fluent in Perl, so if you’re thinking “Why don’t you just...?”, it’s almost certainly because I don’t know about it or didn’t understand it when I heard about it.)

Eric A. Meyer
  • 920
  • 6
  • 20
  • 1
    `grep` operates on arrays and returns matching elements. Maybe you mean `$terms[$ind] =~ /$filter/` to match a single one? – tadman Dec 11 '16 at 21:40
  • Yep, looks like that works as intended—thanks! I’m still confused about why the array didn’t shrink with what I was doing before. – Eric A. Meyer Dec 11 '16 at 21:48
  • It's always tricky to remove elements from an array you're actively iterating over. That shifts the offset by 1 each time you splice something out. – tadman Dec 11 '16 at 21:49
  • 1
    FWIW, [`use VERSION`](http://perldoc.perl.org/functions/use.html) only specifies the _minimum_ version needed; it doesn't emulate the Perl interpreter as it existed at that version. – Matt Jacob Dec 12 '16 at 00:05

2 Answers2

7

You can always regenerate the array minus things you don't want. grep acts as a filter allowing you to decide which elements you want and which you don't:

#!/usr/bin/perl

use strict;

my @terms = ('alpha','beta test','gamma','delta quadrant','epsilon',
           'zeta','eta','theta chi','one iota','kappa');
my @filters = ('beta','gamma','epsilon','iota');

my %filter_exclusion = map { $_ => 1 } @filters;

my @filtered = grep { !$filter_exclusion{$_} } @terms;

print join(',', @filtered) . "\n";

It's pretty easy if you have a simple structure like %filter_exclusion on hand.

Update: If you want to allow arbitrary substring matches:

my $filter_exclusion = join '|', map quotemeta, @filters;

my @filtered = grep { !/$filter_exclusion/ } @terms;
ikegami
  • 367,544
  • 15
  • 269
  • 518
tadman
  • 208,517
  • 23
  • 234
  • 262
  • That one only partly works—it filters out `gamma` and `epsilon`, but not `beta test` or `one iota`. Useful to have on hand for future projects, though! – Eric A. Meyer Dec 11 '16 at 22:00
  • Added a version that tests arbitrary substrings. This one uses a regular expression again, but just one test per entry, not N tests. – tadman Dec 11 '16 at 22:07
  • Cool, thanks! That does indeed work. Mind you, I have no idea whatsoever how or why it works. – Eric A. Meyer Dec 12 '16 at 02:33
  • `grep` acts like a pass-fail filter on each element in `@terms` here, so for a given `$_` from `@terms` it tests if it matches that pattern or not. The pattern is just a regular expression that matches any one of them as substrings. – tadman Dec 12 '16 at 15:35
0

To see what's going on, print the contents of the array in each step: When you splice the array, it shrinks, but your loop iterates over 0 .. $#terms, so at the end of the loop, $ind will point behind the end of the array. When you use grep { ... } $array[ $too_large ], Perl needs to alias the non-existent element to $_ inside the grep block, so it creates an undef element in the array.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @terms = ('alpha', 'beta test', 'gamma', 'delta quadrant', 'epsilon',
             'zeta', 'eta', 'theta chi', 'one iota', 'kappa');
my @filters = qw( beta gamma epsilon iota );

for my $filter (@filters) {
    say $filter;
    for my $ind (0 .. $#terms) {
        if (grep { do {
            no warnings 'uninitialized';
            /$filter/
        } } $terms[$ind]
        ) {
            splice @terms, $ind, 1;
        }
        say "\t$ind\t", join ' ', map $_ || '-', @terms;
    }
}

If you used $terms[$ind] =~ /$filter/ instead of grep, you'd still get uninitialized warnings, but as there's no need to alias the element, it won't be created.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • @ikegami: I don't see `gamma` in the output. Moreover, this is not a "fix", it should only demostrate WHY and WHEN the trailing elements are created - therefore, they're still there. – choroba Dec 12 '16 at 08:19
  • @ikegami: If I `print "@terms"`, I see `alpha delta quadrant zeta eta theta chi kappa`. – choroba Dec 12 '16 at 12:53
  • Oh sorry, the bug happens if you start with `@terms = qw( gamma gamma kappa );`. The second gamma gets moved into `$terms[0]`, which isn't revisited. – ikegami Dec 12 '16 at 13:08
  • 1
    @ikegami: True, you're right. But I was just trying to explain why the undefs exist. – choroba Dec 12 '16 at 13:17