I want to match the entity occurrences in SeqString
. For example:
dict_data = ['johnson', 'apple platform']
SeqString = 'Johnson buys a new phone which is based on Apppple Platform. Johnson very likes the Apple Platform.'
Expected results:
Match 1:Johnson <=> johnson, start_char:0, end_char:7, similarityscore
Match 2:Apppple Platform <=> apple platform, start_char:43, end_char:59, similarityscore
Match 3:Johnson <=> johnson, start_char:61, end_char:68, similarityscore
Match 4:Apple Platform <=> apple platform, start_char:84, end_char:98, similarityscore
In short, the dict_data
is very large. I want to match the entities in dict_data
with a threshold.
I tried:
spaCy library. However, it is based on the exact match. It cannot handle
Apppple Platform
.fuzzywuzzy library, which has method
SequenceMatcher.get_matching_blocks()
, However, "The triples are monotonically increasing in i and in j". It means that it cannot match the second occurrence ofjohnson
.
Any solution for my case?