I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":
+------------------+
| company |
+------------------+
| 1c |
| 1c company |
| 1c game studios |
| 1c wireless |
| 1c-avalon |
| 1c-softclub |
| 1c: maddox games |
| 1c:inoco |
| 1cc games |
+------------------+
I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?
In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.
What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.