get close words of a dictionary (database) when you misspell a word using Naive Bayes algorithm

Question

I would like to use Naive Bayes for text classification to get close words of a dictionary (database) when the user misspell a word. For example: the user enters "sheese" the ouput would be "cheese".

Please how can I use it? knowing that my project is in java.

Thanks, for any suggestions or opinions.

I think asking on https://cs.stackexchange.com/ might be a faster way to get an answer. — mortom123, Apr 01 '20 at 22:17

score 1 · Answer 1 · answered Apr 02 '20 at 03:15

Naive Bayes cant do this. This is not a classification task. The target variables can be anything having correct spelling.

Even if you find some workaround to make this work, you will end up having a really non-useful model, as it might give output for very few words that you have decided earlier. There are other spelling correction techniques to do this. One of the good technique is https://github.com/wolfgarbe/SymSpell

mortom123 · Answer 2 · 2020-04-02T00:32:35.537

My idea would be:

You would have a large Set of Data containing misspelled words and their corresponding correct versions. We are looking for P(correct|wrong).
For each of these you would then calculate P(wrong|correct) (remember we need that for Bayes), meaning the probability of being the wrong word given a correct one. For example: "cheese" might be misspelled as "sheese" or "shess", with the first one being more likely and occurring 75% of the time, and the othere only being there 25% of the time. So: P(sheese|cheese) = 0.75, P(shees|chesse) = 0.25.
You also calculate the total occurrences of each correct word in the given dict. Meaning: P(cheese) = 0.7, P(chess) = 0.3. These would be our P(correct)
Now you get a wrong word as input and can use Bayes Theorem to calculate each probabilty.

P(correct|wrong) = P(wrong|correct) * P(correct) / P(wrong)

P(wrong) will be the same for all possible correct words, so we can just ignore it for now. What we are left with is:

P(correct|wrong) = P(wrong|correct) * P(correct)

(Assuming P(sheese|chess) =0.25) Now given the word "sheese" we can calculate P(cheese|sheese) = 0.7*0.75 = 0.525 and P(chees|sheese) = 0.3*0.25 = 0.075 therefore classifying the word as "cheese"

get close words of a dictionary (database) when you misspell a word using Naive Bayes algorithm

2 Answers2