0

I want to match the entity occurrences in SeqString. For example:

dict_data = ['johnson', 'apple platform']
SeqString = 'Johnson buys a new phone which is based on Apppple Platform. Johnson very likes the Apple Platform.'

Expected results:

Match 1:Johnson <=> johnson, start_char:0, end_char:7, similarityscore

Match 2:Apppple Platform <=> apple platform, start_char:43, end_char:59, similarityscore

Match 3:Johnson <=> johnson, start_char:61, end_char:68, similarityscore

Match 4:Apple Platform <=> apple platform, start_char:84, end_char:98, similarityscore

In short, the dict_data is very large. I want to match the entities in dict_data with a threshold.

I tried:

  1. spaCy library. However, it is based on the exact match. It cannot handle Apppple Platform.

  2. fuzzywuzzy library, which has method SequenceMatcher.get_matching_blocks(), However, "The triples are monotonically increasing in i and in j". It means that it cannot match the second occurrence of johnson.

Any solution for my case?

futurelj
  • 273
  • 5
  • 14

1 Answers1

1

Depending on how much data you got available, you might consider using the exact matches to generate training data to train custom entities in the NER with (https://spacy.io/usage/training#section-ner). NER should than be able to do the fuzzy matching (and more). However, you should try to make sure you keep the texts that would only match fuzzy out of the training data (otherwise you'd be training the NER to not detect Apppple Platform as well).