10

I have a list of around 300k common words stored in an array. So, 1 element of the array = 1 word.

On the other side, I have a huge list of strings that MAY contain one or more of these 300k words inside them. A sample string would be: ifdxawesome453.

Now, I need to check each of these long strings against the common words. If a word is found within that string, immediately return. So, I need to check the 300k words again ifdxawesome453 and see if any of them is contained within it.

So what I do is:

huge_list_of_words.any? do |word|
  random_long_word.include?(word)
end

While this is okay for small samples of random long words, if I have millions of them, suddenly it takes hours to finish the job.

Is there a way to do this faster? The only way I thought of is if I sample out, say 10k most common words out of these 300k and compare against it first, and if no match is found, compare against the full list.

Another way that drastically sped things up was to group the array of the 300k words by size. When I then compare the long random word against it, I first check if the word's size and filter out any longer words. I am then left with the indexes of the equal size or less words, and search against them, starting from the word with the smallest size.

anemaria20
  • 1,646
  • 2
  • 17
  • 36
  • Do the words in the `huge_list_of_words` follow a certain pattern? For example do they all have the same length, or a min or max length, or do they start or end with a certain character? What about the `random_long_word`? Is it one long word? Or is it a string of words separated by some kind of word boundary? Please post a small example. – spickermann Jan 07 '17 at 13:35
  • @spickermann no they're random. Same with random_long_word. – anemaria20 Jan 07 '17 at 13:48
  • Aho Corasick should be very fast, and significantly faster than a raw trie since it doesn't require restarting lookup from the root on each character of the input string. – Reinstate Monica Jan 07 '17 at 19:11
  • @Solomonoff'sSecret are you aware of any good library (for any language) that implements it well? I tested all Ruby libs for Aho Corasick , but they're either abandoned or very poor. – anemaria20 Jan 07 '17 at 19:58
  • @anemaria20 No. But I implemented it myself in Java - it's not very complex - and noticed a 40% speedup over a regular trie on roughly the same data size that you have. – Reinstate Monica Jan 08 '17 at 02:23

3 Answers3

5

Solution

Trie structure is a step in the right direction. SuffixTree might also help.

It looks like Triez gem has more features than Trie gem, but the documentation is far from being complete. :substring sounds perfect, but it seems you can only use it in change_all :

# gem install triez
require 'triez'

huge_list_of_words = Triez.new value_type: :object, default: nil

%w(awesome someword anotherword).each do |word|
  huge_list_of_words[word] = word
end

class String
  def contains_word_from_dict?(dict)
    dict.change_all(:substring, self) do |v|
      return v if v
    end
    nil
  end
end

'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "awesome"
'ifdxawsome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "someword"
'ifdxawsome45sameword3'.contains_word_from_dict?(huge_list_of_words)
# => nil

Test

I tried it with a bigger dictionary (~100k words) and a million lookups :

huge_list_of_words = Triez.new value_type: :object, default: nil

dict = '/usr/share/dict/american-english'
File.foreach(dict) do |word|
  word.chomp!
  huge_list_of_words[word] = word if word.size > 4 # avoid finding `if` or `word`
end

1_000_000.times do
  'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
end

It returned after 22s on my slowish laptop.

To be very honest, I don't understand how change_all works, and what its purpose is. It does seem to work fine for your purpose though! ¯\_(ツ)_/¯

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • Sweet! Just it doesn't work on Windows when you require it, it can't load the .so file, too bad because I'm kind of forced to use for now. – anemaria20 Jan 07 '17 at 16:49
  • `.so` files? Those are linux dynamic libraries, it's clear they shouldn't work on Windows. You might need mingw or cygwin, but I'm not sure. Even a simple Trie class written in plain Ruby will be much faster than just checking every word one after the other. – Eric Duminil Jan 07 '17 at 18:19
  • @EricDuminil I see my RAM usage spiking up as a consequence of this :0 I've read it's the nature of the algo to consume memory. Searches for SuffixTree solutions but they're sparse, for some reason there's plenty of Trie libs on github. – anemaria20 Jan 07 '17 at 18:33
  • @EricDuminil can you compare Triez with rambling-trie? I tested rambling-trie (without compressed mode) vs. trie (from vidstige) and got 2x better performance (with 100k words tested against 300k word dictionary), 7 vs 18 seconds. – anemaria20 Jan 07 '17 at 19:04
  • @anemaria20 : Sorry, I've never tried rambling-trie. Do you have a working code for it? – Eric Duminil Jan 07 '17 at 21:15
  • @EricDuminil Here it is, pretty simple API: https://github.com/gonzedge/rambling-trie – anemaria20 Jan 09 '17 at 00:59
3

The fastesy way is to create a trie for all the words in the huge list. Then you need loop through all the long strings. For a particular string, do custom iteration to start searching the trie for each letter in the string to match all your words. You can terminate early if you only need to find one contained word for extra speed.

For the trie part there seems to be a sweet implementation in ruby here. It even has a example, very similar to your use case that you should be able to tweak.

word = 'forestry'
node = trie.root

word.split('').each do |char|
  break unless node.walk!(char)
  if node.terminal?
    puts "Found me a word: #{node.full_state}"
  end
end
vidstige
  • 12,492
  • 9
  • 66
  • 110
  • Nobody nows, you'd be the first to do it. :-) – vidstige Jan 07 '17 at 13:07
  • @anemaria20 I added a link to a trie implementation – vidstige Jan 07 '17 at 13:07
  • Rubinius's `Hash` class is also implemented as a trie (more specifically a Hash Array Mapped Trie aka HAMT) in pure Ruby (although using a couple of Rubinius primitives). – Jörg W Mittag Jan 07 '17 at 13:16
  • @vidstige I've tried and this works but only if the word is at the beginning of the string. I guess consdering the way binary search trees work, I'll first need to generate all possible combinations of the word, right...But then I'm not sure if the performance will stay as is. – anemaria20 Jan 07 '17 at 13:31
  • @anemaria20 : Yes, you'd need to iterate over substrings : `ifdxawesome453`, `fdxawesome453`, `dxawesome453`, `...`, which will impact the performance. – Eric Duminil Jan 07 '17 at 13:49
  • You only need to consider substrings, not permutations as I wrote in my answer. The answer above got a complete code sample and everything – vidstige Jan 07 '17 at 14:43
  • @vidstige as I understand, if 'heythere' is the string, then 'there' is considered a sub-string. The code above won't recognize 'there' as a sub-string as far as I'm aware. Please let me know if I'm wrong. – anemaria20 Jan 07 '17 at 15:43
  • no exactly, this is correct. That's why I wrote "start searching the trie for each letter in the string to match all your words". The piece of code above is just taken directly from the library I linked. That's why it needs some tinkering to get started. When I wrote answer above, I meant by @Eric Duminil :-) – vidstige Jan 07 '17 at 18:14
  • 1
    @ArieShaw took the liberty to make your code work for substrings as well ;) – Eric Duminil Jan 07 '17 at 18:16
3

A benchmark between the Trie gem and Triez gem in this particular use case:

word count: 228982
                   user     system      total        real
trie          13.410000   0.050000  13.460000 ( 13.463473)
triez         11.080000   0.010000  11.090000 ( 11.102195)
trie tail     39.920000   0.140000  40.060000 ( 40.102285)
triez tail    28.960000   0.030000  28.990000 ( 29.022630)

Generally, Triez is faster for the Op's use case.

require 'triez'
require 'trie'
require 'benchmark'

DICT = '/usr/share/dict/web2'

triez = Triez.new value_type: :object, default: nil
trie = Trie.new

count = 0
File.foreach(DICT) do |word|
  word.chomp!
  if word.size > 4
    triez[word] = word
    trie.add word
    count += 1
  end
end

puts "word count: #{count}"

def in_trie?(str, trie)
  0.upto(str.length - 1) do |i|
    node = trie.root
    i.upto(str.length - 1) do |j|
      break unless node.walk! str[j]
      if node.terminal?
        return str[i..j]
      end
    end
  end
  nil
end

def in_triez?(str, triez)
  triez.change_all(:substring, str) do |v|
    return v if v
  end
  nil
end

Benchmark.bm(12) do |b|
  b.report('trie') do
    1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
  end
  b.report('triez') do
    1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
  end
  b.report('trie tail') do
    1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
  end
  b.report('triez tail') do
    1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
  end
end

UPDATE benchmark for rambling-trie, where the lines with c prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)

Word count: 228982, ROUND: 100000
                      user     system      total        real
trie              1.510000   0.000000   1.510000 (  1.511772)
triez             1.170000   0.000000   1.170000 (  1.176075)
rambling          4.800000   0.010000   4.810000 (  4.847021)
c rambling       25.060000   0.050000  25.110000 ( 25.172771)
trie tail         4.540000   0.010000   4.550000 (  4.566233)
triez tail        3.080000   0.010000   3.090000 (  3.092655)
rambling tail     4.780000   0.010000   4.790000 (  4.803114)
c rambling tail  23.470000   0.020000  23.490000 ( 23.525066)

It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.

class Rambling::Trie::Container
  def match_prefix?(str)
    root.match_prefix?(str.chars)
  end
end

class Rambling::Trie::RawNode
  def match_prefix?(chars, i = 0)
    if children_tree.empty?
      true
    elsif i >= chars.size
      false
    else
      letter = chars[i].to_sym
      child = children_tree[letter]
      !!child && child.match_prefix?(chars, i + 1)
    end
  end
end

class Rambling::Trie::CompressedNode
  def match_prefix?(chars)
    if children_tree.empty?
      true
    if chars.empty?
      false
    else
      !!(recursive_get :match_prefix?, chars)
    end
  end
end

def in_r_trie?(str, r_trie)
  0.upto(str.length - 1) do |i|
    if r_trie.match_prefix? str[i..-1]
      return true
    end
  end
  false
end
Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • Thanks for the info. Do you understand how `change_all(:substring,str)` works? – Eric Duminil Jan 07 '17 at 16:34
  • Thanks @ArieShaw, can you also add github.com/gonzedge/rambling-trie to your comparison? They have a compressed trie version as well. – anemaria20 Jan 07 '17 at 18:34
  • @EricDuminil I didn't check the implementation. It should be similar: starting from each character position of the input string, iteratively check if the prefix is a word in the trie. If it is, remove that word and insert the given replacement. I may be wrong though. – Arie Xiao Jan 08 '17 at 04:24
  • That's an awesome benchmark @ArieShaw!! You are right to say that it's a pure Ruby implementation, so it's bound to be slower even with an optimal algorithm. – Edgar Gonzalez Jan 09 '17 at 01:27
  • Also, @ArieShaw, do you mind sending a pull request to [rambling-trie](https://github.com/gonzedge/rambling-trie) with this code? We have [an issue currently open for that](https://github.com/gonzedge/rambling-trie/issues/9). Any contributions are welcome :) – Edgar Gonzalez Jan 09 '17 at 01:30
  • @ArieShaw I implemented some similar changes to rambling-trie and made [an updated benchmark](https://gist.github.com/gonzedge/dec2c31335342dba02a77141fa8184fb). – Edgar Gonzalez Jan 24 '17 at 05:17