Hyphenation algorithm using Bloom filter

Question

A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.

I don't understand how a Bloom filter would be used in a hyphenation algorithm.

A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.

Would the Bloom filter contain both hyph-enation and hyphena-tion, and client code would query the filter for h-yphenation, hy-phenation, hyp-henation, ...?

Here's what the original paper says:

Analysis of Hyphenation Sample Application

[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.

Thomas Mueller · Accepted Answer · 2019-11-15T15:19:37.320

For this case,

the dictionary is stored on disk and contains all words with the correct hyphenation,
the Bloom filter contains just keys that require special hyphenation, e.g. maybe hyphenation itself,
the Bloom filter responds with "probably" or "no".

Then the algorithm to find the possible hyphenations of a word is:

word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
    lookupInDictionary(word).getHypenation();
} else {
    // x == "no" case
    useSimpleRuleBasedHypenation(word);
}

If the Bloom filter responds with "probably", then the algorithm would have to do a disk read in the dictionary.

The Bloom filter would respond with "probably" sometimes if there are in fact no special rules, in which case a disk I/O is done unnecessarily. But that's OK as long as that doesn't happen too often (false positive rate is low, e.g. 1/16).

The Bloom filter, as it doesn't have false negatives, would never respond with "no" for cases do have special hyphenation.

Ok, so if I understand this correctly, the Bloom filter itself has basically nothing to do with the actual hyphenation. It's simply used as a guard against performing an expensive operation unnecessarily. — aioobe, Nov 15 '19 at 15:39

Hyphenation algorithm using Bloom filter

1 Answers1