Fuzzy (text / string) matching with AI (for handling common abbreviations)

Question

Hi,

i have the following task:

1) i have a list A of 700.000 train/bus stations with name

2) i have a list B of 300.000 train/bus stations with name (slightly different spelled of course)

3) for lets say 150.000 elements of B i know the exact match in A.

4) i want to match the other half of the elements in B with A (lets say we know there is a match).

I know there are lots of similar questions here regarding this kind of fuzzy text/string matching, but what i find unsatisfying is the fact, that more or less they all depend on algorithms like Levenshtein distance, and Levenshtein is kind of problematic if your texts have some abbreviations. For example, "Gleis"="Gl." (german for platform) or "strasse" = "str." (german for street) should not increase the difference-score. Same for abbreviated citynames and and and.

There are more of these Abbreviations than i can handle manually, so i thought i could use the fact i have the training data in 3)

Does anybody have an idea / thoughts / projects / remarks on using AI / machine learning for this kind of task? The training data should be enough for the algorithm to learn most of the common abbreviations.

Also, i have seen some AI-approaches to this, but they only use AI to find a suitable border of the used distance function to distinguish between match and no-match, which does not help with the abbreviations.

Thanks, Tim

score 0 · Answer 1 · answered Mar 11 '20 at 15:57

Regarding your problem, I'd suggest you train an embedding with a triplet loss (https://towardsdatascience.com/lossless-triplet-loss-7e932f990b24).

For each of your data in 3, you define a label that represent if the data is from A or B.

When training a triplet loss, you define an anchor (a random doc from your dataset) and two docs with the same id and different label. Here, it is a doc from A and it's corresponding doc from B.

This will train an embedding configure to match element from A and B that has the same meaning.

Good luck !

Fuzzy (text / string) matching with AI (for handling common abbreviations)

1 Answers1