4

I have a keywords list and a blacklist. I want to delete all keywords that contain any of blacklist item. At the moment Im doing it this way:

my @keywords = ( 'some good keyword', 'some other good keyword', 'some bad keyword');
my @blacklist = ( 'bad' );

A: for my $keyword ( @keywords ) {
    B: for my $bl ( @blacklist ) {
        next A if $keyword =~ /$bl/i;      # omitting $keyword
    }
    # some keyword cleaning (for instance: erasing non a-zA-Z0-9 characters, etc)
}

I was wondering is there any fastest way to do this, becouse at the moment I have about 25 milion keywords and couple of hundrets words in blacklist.

gib
  • 738
  • 1
  • 8
  • 16

3 Answers3

4

The most straightforward option is to join the blacklist entries into a single regular expression, then grep the keyword list for those which don't match that regex:

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

my @keywords = 
  ('some good keyword', 'some other good keyword', 'some bad keyword');
my @blacklist = ('bad');

my $re = join '|', @blacklist;
my @good = grep { $_ !~ /$re/ } @keywords;

say join "\n", @good;

Output:

some good keyword
some other good keyword
Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
  • 1
    Thanks a lot! For a test with 50k keywords, execution time went down from 34sec to 0,6sec – gib May 24 '13 at 12:16
  • 1
    https://metacpan.org/module/Regexp::Assemble - Regexp::Assemble improves performance more. – Oesor May 24 '13 at 14:13
  • 1
    To demonstrate: perl -MData::Printer -MRegexp::Assemble -E "my $ra = Regexp::Assemble->new(); for my $word (qw/apple asp application aspire applicate aardvark snake/) { $ra->add($word) } p($ra->re);" gives (?:a(?:ppl(?:icat(?:ion|e)|e)|sp(?:ire)?|ardvark)|snake) – Oesor May 24 '13 at 14:19
3

Precompiling the search may help my @blacklist = ( qr/bad/i ) if you want to keep the nested loops.

Alternatively, changing from my @blacklist = ( 'bad', 'awful', 'worst' ) to my $blacklist = qr/bad|awful|worst/; and then replacing the inner loop with if ( $keywords[$i] =~ $blacklist ) ....

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
0

This should do it:

my @indices;
for my $i (0..$#keywords) {
  for my $bl (@blacklist) {
    if ($keywords[$i] =~ $bl) {
      push(@indices, $i);
      last;
    }
  }
}
for my $i (@indices) {
  @keywords = splice(@keywords, $i);
}
dlaehnemann
  • 671
  • 5
  • 17