My idea would be:
- You would have a large Set of Data containing misspelled words and their corresponding correct versions. We are looking for
P(correct|wrong)
.
- For each of these you would then calculate
P(wrong|correct)
(remember we need that for Bayes), meaning the probability of being the wrong word given a correct one. For example: "cheese" might be misspelled as "sheese" or "shess", with the first one being more likely and occurring 75% of the time, and the othere only being there 25% of the time. So: P(sheese|cheese) = 0.75
, P(shees|chesse) = 0.25
.
- You also calculate the total occurrences of each correct word in the given dict. Meaning:
P(cheese) = 0.7
, P(chess) = 0.3
. These would be our P(correct)
- Now you get a wrong word as input and can use Bayes Theorem to calculate each probabilty.
P(correct|wrong) = P(wrong|correct) * P(correct) / P(wrong)
P(wrong)
will be the same for all possible correct words, so we can just ignore it for now. What we are left with is:
P(correct|wrong) = P(wrong|correct) * P(correct)
(Assuming P(sheese|chess) =0.25
)
Now given the word "sheese"
we can calculate P(cheese|sheese) = 0.7*0.75 = 0.525
and P(chees|sheese) = 0.3*0.25 = 0.075
therefore classifying the word as "cheese"