1

A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.

I don't understand how a Bloom filter would be used in a hyphenation algorithm.

A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.

Would the Bloom filter contain both hyph-enation and hyphena-tion, and client code would query the filter for h-yphenation, hy-phenation, hyp-henation, ...?

Here's what the original paper says:

Analysis of Hyphenation Sample Application

[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.

aioobe
  • 413,195
  • 112
  • 811
  • 826

1 Answers1

1

For this case,

  • the dictionary is stored on disk and contains all words with the correct hyphenation,
  • the Bloom filter contains just keys that require special hyphenation, e.g. maybe hyphenation itself,
  • the Bloom filter responds with "probably" or "no".

Then the algorithm to find the possible hyphenations of a word is:

word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
    lookupInDictionary(word).getHypenation();
} else {
    // x == "no" case
    useSimpleRuleBasedHypenation(word);
}

If the Bloom filter responds with "probably", then the algorithm would have to do a disk read in the dictionary.

The Bloom filter would respond with "probably" sometimes if there are in fact no special rules, in which case a disk I/O is done unnecessarily. But that's OK as long as that doesn't happen too often (false positive rate is low, e.g. 1/16).

The Bloom filter, as it doesn't have false negatives, would never respond with "no" for cases do have special hyphenation.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • 1
    Ok, so if I understand this correctly, the Bloom filter itself has basically nothing to do with the actual hyphenation. It's simply used as a guard against performing an expensive operation unnecessarily. – aioobe Nov 15 '19 at 15:39
  • Yes, this is how I understand this paragraph. – Thomas Mueller Nov 17 '19 at 12:57