2

Peter Norvig's famous spellchecker (Java 8 version here) is able to correct single words, if something close to that word appears in training data. But how can I adapt it to handle entire phrases. For example, if I have a file where each phrase is separated by a new line :

Plastic box
Pencils and sketch
Romeo and Juliet
.
.
.

If I tell the algorithm to correct 'Platic', it should return 'Plastic box'. Similarly, if I tell it to correct 'Pencils', it should return 'Pencils and sketch'.

I tried to change the following lines of the above code (Java version) :

Stream.of(new String(Files.readAllBytes( dictionaryFile )).toLowerCase().replaceAll("[^a-z ]","").split(" ")).forEach( (word) ->{
            dict.compute( word, (k,v) -> v == null ? 1 : v + 1  );
        });

to

 Stream.of(new String(Files.readAllBytes( dictionaryFile )).toLowerCase().split("\n")).forEach( (word) ->{
            dict.compute( word, (k,v) -> v == null ? 1 : v + 1  );
        });

but it didn't seem to work.

Daud
  • 7,429
  • 18
  • 68
  • 115

1 Answers1

0

If you carefully go through Norvig's spellchecker, you will find that as error model he uses the words at edit distance 1 and 2 from the mis-spelled word. So, if you wanted to correct Platic using the file big.text as dictionary, it could find the word Elastic which is at edit distance 2 as the candidate correct word.

Now, with your modified code, the phrase Plastic box is not even within 2 edit distance from the word Platic and it will not even be considered as a candidate in the error model, that's why it does not work.

For example, edit distance between them is 5 and then you have to implement functions edit3, edit4 and edit5 to make it work, which will need to consider millions of words and will be quite inefficient.

Instead, I think you can consider bigram language model and despite returning a single possible candidate word for a misspelled word, you can return the most likely bigram phrase, depending upon the probability of occurrence in the dictionary, with the language model P(Plastic box)=P(Plastic)*P(box|Plastic) and the candidate phrase's probability a P(Plastic box)*P(Platic|Plastic Box)withBayesformula, if you have anerror model` in place (or you have data to learn one).

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63