1

I am trying to write Python code that takes a word as an input (e.g. book), and outputs the most similar word with similarity score.

I have tried different off-the-shelf edit-distance algorithms like cosine, Levenshtein and others, but these cannot tell the degree of differences. For example, (book, bouk) and (book,bo0k). I am looking for an algorithm that can gives different scores for these two examples. I am thinking about using fastText or BPE, however they use cosine distance.

Is there any algorithm that can solve this?

scipilot
  • 6,681
  • 1
  • 46
  • 65
Niddal Imam
  • 23
  • 1
  • 4

3 Answers3

1

The problem is that both "bo0k" and "bouk" are one character different from "book", and no other metric will give you a way to distinguish between them.

What you will need to do is change the scoring: Instead of counting a different character as an edit distance of 1, you could give it a higher score if it's a different character class (ie a digit instead of a letter). That way you will get a different score for your examples.

You might have to adapt the other scores as well, though, so that replacement / insertion / deletion are still consistent.

Oliver Mason
  • 2,240
  • 2
  • 15
  • 23
1

That is a very interesting question - probably with many possible answers. You could possibly add in bigram (n-gram) analysis to rank how likely the letters would be related to each other in typical words.

Presuming your system doesn't "know" the target word, but someone types "bouk". Then it analyses all the bigrams:

bo, ou, uk

or trigrams

bou, ouk

I would guess here that "bo", "ou", "bou" would score well as they are common, but "uk" and "ouk" would be not likely in English. So this could simply have a 3/5 score, but actually each trigram would have its own frequency score (probability), so the overall number for the proposed word could be quite refined.

Then comparing that to "bo0k" you'd look at all bigrams:

bo, o0, 0k

or trigrams

bo0, o0k

Now you can see that only "bo" would score well here. All the others would not be found in a common n-gram corpus. So this word would score much lower than "bouk" for likelihood, e.g. 1/5 compared to the 3/5 for "bouk".

There would be roughly three parts to the solution:

You would need a corpus of established n-gram frequencies for the language. For example this random blog I found discusses that: https://blogs.sas.com/content/iml/2014/09/26/bigrams.html

Then you would need to process (tokenise and scan) your input words into n-grams and then look up their frequencies in the corpus. You could use something like SK Learn,

Then you can sum the parts in whatever way you like to establish the overall score for the word.

Note you may find most tokenisers and n-gram processing for natural language centres around word relations not letters within words. It's easy to get lost on that, as often the fact a library is focused on word-grams is not explicitly mentioned because it's the most common. I've noticed that before, but n-grams are used in all sorts of other data sets too (timeseries, music, any sequence really) This question does discuss how you can convert SK Learn's vectoriser to do letter-grams, but I've not tried this myself: N-grams for letter in sklearn

scipilot
  • 6,681
  • 1
  • 46
  • 65
  • Thank you @scipilot for your suggestins and generous explanations. I liked both ideas, but the firs one is more suitable for my task. – Niddal Imam Apr 25 '20 at 18:17
  • You are welcome. BTW The best way to thank people on stack overflow is to upvote their answers if they are relevant and well written, even if not accepted. Some questions will have multiple valid answers, to it's important to credit them all with upvotes. – scipilot Apr 26 '20 at 03:10
0

I have a second idea which uses "domain knowledge" in this case that someone is typing on a keyboard. It doesn't directly answer your question, but illustrates there might be different approaches entirely to achieving the end goal (which you haven't directly described - i.e. a user interface presenting spell checker options?).

I once wrote an algorithm at uni which used a keyboard layout map (as one strategy in a spell checker), which iterated over all the surrounding keys, to propose "fat fingering" corrections when a word was not found in the dictionary.

So for example O is surrounded by I90PLK, I is ringed by U89OK or perhaps U89OKJ.

Therefore you can mutate each input word by replacing each letter with all combinations of the surrounding neighbours. You will end up with lots of combinations, but most of them will be completely bogus words. One of them could be a perfect match to a dictionary word.

So all you need to do is generate all the possible typo neighbours and simply look for all dictionary words in the mutants which should be an efficient query.

e.g. for bo0k

bo0k
vo0k
go0k
ho0k
no0k
_o0k

bi0k
b90k
b00k
bp0k
bl0k
bk0k

bo9k
bo0k
bo-k
bopk
book       - bingo!
boik

bo0j
bo0u
bo0i
bo0o
bo0l
bo0,
bo0m

You can see here that there is only one dictionary word in the entire set of basic typo mutants.

So this doesn't use any similarity algorithms but in the case of keyboard typos, it can find corrections. You could even record user "acceptance" of these proposals and form your own corpus of correction probabilities. I'm guessing many typos are pretty common and consistent.

Obviously this doesn't cover spelling errors, although a similar domain knowledge approach could be taken there, per natural language with its specific quirks and difficulties.

scipilot
  • 6,681
  • 1
  • 46
  • 65