Hi,
i have the following task:
1) i have a list A of 700.000 train/bus stations with name
2) i have a list B of 300.000 train/bus stations with name (slightly different spelled of course)
3) for lets say 150.000 elements of B i know the exact match in A.
4) i want to match the other half of the elements in B with A (lets say we know there is a match).
I know there are lots of similar questions here regarding this kind of fuzzy text/string matching, but what i find unsatisfying is the fact, that more or less they all depend on algorithms like Levenshtein distance, and Levenshtein is kind of problematic if your texts have some abbreviations. For example, "Gleis"="Gl." (german for platform) or "strasse" = "str." (german for street) should not increase the difference-score. Same for abbreviated citynames and and and.
There are more of these Abbreviations than i can handle manually, so i thought i could use the fact i have the training data in 3)
Does anybody have an idea / thoughts / projects / remarks on using AI / machine learning for this kind of task? The training data should be enough for the algorithm to learn most of the common abbreviations.
Also, i have seen some AI-approaches to this, but they only use AI to find a suitable border of the used distance function to distinguish between match and no-match, which does not help with the abbreviations.
Thanks, Tim