It looks like you are trying to use tools like fuzzywuzzy
that were not really designed for that.
One possible approach to this problem could be to find how many tokens from the second text are present in the first text.
This can be normalized by the total number of tokens in the second text.
Then you can threshold to whatever value you deem fit.
One possible way of implementing this is the following:
- Tokenize (i.e. convert to a list of tokens) the input texts
a
and b
.
- Collect each token list into a corresponding counter (i.e. some data structure for counting the which tokens are present).
- Compute the intersection
a_i_b
of the tokens for a
and b
.
- Compute some metric based on the the total occurrences of
a_i_b
(weight_a_i_b
) and the total occurrences of b
(weight_b
). This final metric is a proxy of the "amount" of b
contained into a
. This could be a ratio or a difference and should use the fact that weight_a_i_b <= weight_b
by construction.
The difference weight_b - weight_a_i_b
results in a number between 0 and the number of tokens in b
, which is also a direct measure of how many tokens from b
are not found in a
, hence 0 indicates perfect matching.
The ratio weight_a_i_b / weight_b
results in a number between 0 and 1, with 1 meaning perfect matching and 0 meaning no matching.
The difference metric is probably more suited for small number of tokens and easier to interpret and threshold in a meaningful way (e.g. accepting a value below 2 means that at most one token from b
is not present in a
).
On the other hand the ratio is more standard and it is probably more suited for larger tokens lists sizes.
All this would translate into this code, leveraging collections.Counter()
for the dealing with counting the tokens:
import collections
def contains_tokens(
text_a,
text_b,
tokenize_kws=None,
metric=lambda a, b, a_i_b: b - a_i_b):
"""Compute the ratio of `b` contained in `a`."""
tokenize_kws = dict(tokenize_kws) if tokenize_kws is not None else {}
counter_a = collections.Counter(tokenize(text_a, **tokenize_kws))
counter_b = collections.Counter(tokenize(text_b, **tokenize_kws))
counter_a_i_b = counter_a & counter_b
weight_a = counter_total(counter_a)
weight_b = counter_total(counter_b)
weight_a_i_b = counter_total(counter_a_i_b)
return metric(weight_a, weight_b, weight_a_i_b)
The first step, i.e. tokenization, is achieved with the following function.
This is a bit primitive, but gets the job done for the your input.
What it does essentially is to replace a number of special characters (ignores
) into blanks, and then splits the string along the blanks, optionally excluding the tokens in a blacklist (excludes
).
def tokenize(
text,
case_sensitive=False,
ignores=('_', '-', ':', ',', '.', '?', '!'),
excludes=('the', 'from', 'to')):
"""Tokenize a text, ignoring some characters and excluding some tokens."""
if not case_sensitive:
text = text.lower()
for ignore in ignores:
text = text.replace(ignore, ' ')
for token in text.split():
if token not in excludes:
yield token
To count the total number of values in a counter the following function is used. However, for Python 3.10 and later, there is a build-in method Counter.total()
which does the exact same.
def counter_total(counter):
"""Count the total number of values."""
return sum(counter.values())
For the given input this becomes:
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
and
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos2'
# one token from `b` (`pos2`) not in `a`
print(contains_tokens(a, b))
# 1
Note that distance-based functions (like fuzz.token_set_ratio()
or fuzz.partial_ratio()
) cannot be used in this context because they will be sensitive to how much "noise" is present in the first text, e.g. if b = 'a b c'
, those tokens are contained equally in a = 'a c'
as well as a = 'a b c d e f g h i'
, and any distance cannot account for that, most notably because distance functions are symmetric (i.e. f(a, b) = f(b, a)
) while the function you are looking for is not (i.e. f(a, b) != f(b, a)
).