A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.
I don't understand how a Bloom filter would be used in a hyphenation algorithm.
A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.
Would the Bloom filter contain both hyph-enation
and hyphena-tion
, and client code would query the filter for h-yphenation
, hy-phenation
, hyp-henation
, ...?
Here's what the original paper says:
Analysis of Hyphenation Sample Application
[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.