1

I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I want those name to be considered as a match to the regular list. I know that using stringdist is a solution to the problem but I need a machine learning algorithm

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
rohit
  • 47
  • 2
  • 10
  • possible duplicate of [machine learning to overcome typo errors](http://stackoverflow.com/questions/18329826/machine-learning-to-overcome-typo-errors) – Ferdinand.kraft Sep 11 '13 at 01:42

1 Answers1

0

As it was already mentioned here machine learning to overcome typo errors , machine learning tools are too much for such task, but the simplest possibility would be to merge those approaches.

On one hand, you can compute the edit distance between given word x and each of the dictionary words d_i. Additionaly, you can traing per-word classifier

c(d_i, distance(x,d_i)) 

returning True (class 1) if a given edit distance has been learned to be sufficient to consider x a missspelled version of d_i. This can give you more general model then not using machine learning, as you can have different thresholds for each dictionary word (some words are more often misspelled then others), but obviously, you have to prepare a training set in form of (misspelled_word, correct_one) (and add also (correct_one, correct_one).

You can use any type of binary classifier for such task, which can work on "real" input data.

Community
  • 1
  • 1
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • continuing with the problem above. My database regular list has around 150,000 words ,whereas new list has around 350,000 words.calcuting distance between two items would require (150,000 * 350,000 searches). Its working really very very slow.please could I find a better way – rohit Aug 26 '13 at 18:35
  • There are dozens of ways to speed things up. You can build various types of indexing, which cut off parts of database for which distance is too big to be considered (which can be done in constant time by for example hashing 3 letter prefixes and 3 letter suxifes and looking only for those, whose prefix or suffix matches perfercly). For large search you should consider using existing search engines, like eg. lucene http://lucene.apache.org/core/ – lejlot Aug 26 '13 at 18:46
  • can any database could be used to solve this. supposing i put the two tables as regular_list and new _list. then through a query could i compare the above mentioned distance and let the database return the solution accordingly???? – rohit Aug 30 '13 at 15:21