7

I have a string of words; let's call them bad:

bad = "foo bar baz"

I can keep this string as a whitespace separated string, or as a list:

bad = bad.split(" ");

If I have another string, like so:

str = "This is my first foo string"

What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?

#Find if a word is there
bad.split(" ").each do |word|
  found = str.include?(word)
end

#Remove the word
bad.split(" ").each do |word|
  str.gsub!(/#{word}/, "")
end
Mike Trpcic
  • 25,305
  • 8
  • 78
  • 114

8 Answers8

9

If the list of bad words gets huge, a hash is a lot faster:

    require 'benchmark'

    bad = ('aaa'..'zzz').to_a    # 17576 words
    str= "What's the fasted way to check if any word from the bad string is within my "
    str += "comparison string, and what's the fastest way to remove said word if it's "
    str += "found" 
    str *= 10

    badex = /\b(#{bad.join('|')})\b/i

    bad_hash = {}
    bad.each{|w| bad_hash[w] = true}

    n = 10
    Benchmark.bm(10) do |x|

      x.report('regex:') {n.times do 
        str.gsub(badex,'').squeeze(' ')
      end}

      x.report('hash:') {n.times do
        str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
      end}

    end
                user     system      total        real
regex:     10.485000   0.000000  10.485000 ( 13.312500)
hash:       0.000000   0.000000   0.000000 (  0.000000)
steenslag
  • 79,051
  • 16
  • 138
  • 171
  • I get a `regular expression too big` error with this copypasted. – Arkku Mar 31 '10 at 11:54
  • (Of course, that's yet another reason to go for the hash solution. =) – Arkku Mar 31 '10 at 12:03
  • I had a similar task a couple years ago, and spent a couple days searching out lists of known offensive words and terms. There were 479 English words considered offensive after I boiled it down. For every letter in our alphabet there are at least two l33t substitutions. Precomputing alternate spellings for every word plus its l33t variations put me into millions of alternatives. Trying to match a text block against that was not going to be fast enough for real-time cleaning. I finally got my boss to shelve it but he passed it to a Jr. developer who tried doing it without the l33t. Easy to beat – the Tin Man Mar 31 '10 at 22:30
  • Some examples of how bad this could get if you (or a boss) want to get anal enough: foo, f0o, fo0, f00, ƒoo, ƒo0, ƒ00, phoo, ph0o, pho0, ph00, Foo, FOo, FOO, Fo0, F0o, F00. Factoring out the case differences is pretty easy but the variations of spelling still create a big list. – the Tin Man Mar 31 '10 at 23:00
3

bad = "foo bar baz"

=> "foo bar baz"

str = "This is my first foo string"

=> "This is my first foo string"

(str.split(' ') - bad.split(' ')).join(' ')

=> "This is my first string"

jeem
  • 919
  • 7
  • 7
  • This is concise to read, but how fast does it perform compared to a purely regex approach? I need high performance here, as the list of "bad" words can be HUGE – Mike Trpcic Mar 31 '10 at 03:06
  • I like the conciseness of it but it doesn't do as well as the regex in benchmarking. – the Tin Man Mar 31 '10 at 04:40
1

All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:

badex = /\b(#{bad.split.join('|')})\b/i

In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.

#!/usr/bin/ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex:') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze:') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('array subtraction') do
    n.times { (str.split(' ') - bad.split(' ')).join(' ') }
  end
end

I made the str variable a lot longer, to make the routines work a bit harder.

                          user     system      total        real
regex:                0.740000   0.010000   0.750000 (  0.752846)
regex with squeeze:   0.570000   0.000000   0.570000 (  0.581304)
array subtraction     1.430000   0.010000   1.440000 (  1.449578)

Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!

Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.

I haven't seen anything similar in Ruby to Perl's Regexp::Assemble, but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.

And we won't even go into words that are benign or offensive depending on the context.


I added a bit more comprehensive test for the array-subtraction benchmark, to fit how it would need to work in a real piece of code. The if clause is specified in the answer, this now reflects it:

#!/usr/bin/env ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

str_split = str.split
bad_split = bad.split

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('bad.any?') do
    n.times { 
      if (bad_split.any? { |bw| str.include?(bw) })
        (str_split - bad_split).join(' ')
      end
    }
  end

  x.report('array subtraction') do
    n.times { (str_split - bad_split).join(' ') }
  end

end

with two test runs:

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.001093)
regex with squeeze    0.870000   0.000000   0.870000 (  0.873224)
bad.any?              1.760000   0.000000   1.760000 (  1.762195)
array subtraction     1.350000   0.000000   1.350000 (  1.346043)

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.004365)
regex with squeeze    0.870000   0.000000   0.870000 (  0.868525)
bad.any?              1.770000   0.000000   1.770000 (  1.775567)
array subtraction     1.360000   0.000000   1.360000 (  1.359100)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

I usually make a point of not optimizing without measurements, but here's a wag:

To make it fast, you should iterate through each string once. You want to avoid a loop with bad count * str count inner compares. So, you could build a big regexp and gsub with it.

(adding foo variants to test word boundary works)

str = "This is my first foo fooo ofoo string"

=> "This is my first foo fooo ofoo string"

badex = /\b(#{bad.split.join('|')})\b/

=> /\b(foo|bar|baz)\b/

str.gsub(badex,'').gsub('  ',' ')

=> "This is my first fooo ofoo string"

Of course the huge resulting regexp might be as slow as the implied nested iteration in my other answer. Only way to know is to measure.

jeem
  • 919
  • 7
  • 7
0
bad = %w(foo bar baz)
str = "This is my first foo string"

# find the first word in the list
found = bad.find {|word| str.include?(word)}

# remove it
str[found] = ''  ;# str => "This is my first  string"
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • This will fail if the string contains more than one occurrence of a bad word, because `str[found]` will only find the first occurrence. It needs to loop until all the bad words are found. – the Tin Man Jan 07 '12 at 20:24
0

I'd benchmark this:

bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')

# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }

# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')

any? will quick checking as soon as it sees a match. If you can order your bad words by their probability, you can save some cycles.

Levi
  • 4,628
  • 1
  • 18
  • 15
0

Here's one that will check for words and phrases.

 def checkContent(str)
     bad = ["foo", "bar", "this place sucks", "or whatever"]

     # may be best to map and singularize everything as well. 
     # maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
     # maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"


     bad_hash = {}
     bad_phrase_hash = {}

     bad.map(&:downcase).each do |word|
         words = word.split().map(&:downcase)
         if words.length > 1
             words.each do |inner|
                if bad_hash.key?(inner)
                    if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
                         bad_hash[inner][words.length] = true
                    elsif bad_hash[inner] === 1
                        bad_hash[inner] = {1=>true,words.length => true}
                    end
                else
                    bad_hash[inner] = {words.length => true}
                end
             end
             bad_phrase_hash[word] = true
         else
             bad_hash[word] = 1
         end
     end

     string = str.split().map(&:downcase)
     string.each_with_index do |word,index|
        if bad_hash.key?(word)
            if bad_hash[word].is_a?(Hash)
                if bad_hash[word].key?(1)
                    return false
                else
                    bad_hash[word].keys.sort.each do |length|
                        value = string[index...(index + length)].join(" ")
                        if bad_phrase_hash.key?(value)
                            return false
                        end
                    end
                end
            else
                return false
            end
        end
     end
     return true
  end
Jack Rothrock
  • 407
  • 1
  • 8
  • 21
-2

The include? method is what you need. The ruby String specificacion says:

str.include?( string ) -> true or false Returns true if str contains the given string or character.

"hello".include? "lo" -> true

"hello".include? "ol" -> false

"hello".include? ?h -> true

Note that it has O(n) and what you purposed is O(n^2)

z-index
  • 409
  • 8
  • 14
  • Won't this give a lot of false positives? `Assessment.include? "ass"` => true (hopefully ass is not a disallowed word or something like that) – SeanJA Mar 31 '10 at 03:35
  • If you look at my code examples, I am using .include? to check for the existence of a word, but am looking for a faster way to do it than looping through a bunch of words (which will get slow). include also doesn't help me remove that word. – Mike Trpcic Mar 31 '10 at 03:55
  • String include? will give false positives. – Levi Mar 31 '10 at 18:51
  • `'Assessment'.include? "ass" # => false`, not `true`. `'Assessment'.downcase.include? "ass" # => true` – the Tin Man Jan 07 '12 at 19:56