Comparing two strings with low/no consistency

Question

I have two strings

a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'

I need to check if they match and print out the result accordingly. As b is the string I want checked in a, the result should print out 'Match' or 'Mis-match' accordingly. The code shoule not depend on the '_' in a, as they can be '-' or spaces as well. I have tried using fuzzywuzzy library and the fuzzy.token_set_ratio to calculate the ratio. From observation, I chose a value of 95 to be convincing. I want to know if there is another way to check this without using fuzzywuzzy, probably difflib. I tried using difflib and SequenceManager, but all I get is a word wise comparison and am unable to combine the result exactly.

I have tried the following code.

from fuzzywuzzy import fuzzy
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
ratio = fuzzy.token_set_ratio(a.lower(), b.lower())
if ratio >= 95:
    print('Match')
else:
    print('Mis-Match')

output

'Mis-Match'

This gives a score of 64 while all of controlling, pos0 and pos1 are in a and in b and should give a match instead.

I tried this as this doesn't depend on the '_' or '-' or spaces.

It looks like you are not trying to see if texts `a` and `b` matches, but rather how much of `b` is contained in `a`. Is that correct? If so, you should probably at least change the title to your question. — norok2, Apr 12 '22 at 09:03

Amir Shamsi · Answer 1 · 2022-04-11T13:15:23.763

You can use gensim library implement MatchSemantic and write code like this as a function:

Initialization

if run the code for the first time a process-bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code

Code

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    doc_similarity_scores = index[query_tf]

    # Output the sorted similarity scores and documents
    sorted_indexes = np.argsort(doc_similarity_scores)[::-1]

    for idx in sorted_indexes:
        if documents[idx] != '':
            if doc_similarity_scores[idx] > 0.0: print('Match')
            else: print('Mis-Match')

Usage

for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents

Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple in my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

so we know that the first item I have an apple in my basket has a semantical relation with Fruit and Vegetables so it prints Match and for the second item no relation will be found so it prints Mis-Match

output:

Match
Mis-Match

I spent time answering you in high quality, hope it helps :) — Amir Shamsi, Mar 23 '22 at 06:30
Thanks for the answer. However, the code should show a match in both cases as in Test 1 and Test 2. That is where I am facing an issue. Highly inconsistent data strings in 'a' and 'b'. — Rav3H34rt, Mar 23 '22 at 07:43
if for example, `b` becomes something like `b = 'controlling many things'` should it output `Match` as well? — Amir Shamsi, Mar 23 '22 at 08:12
No. You see the key words here are 'controlling', 'pos0', and 'pos1'. These key words could change from case to case but so does 'b'. This dynamic changes are what I am unable to navigate around. — Rav3H34rt, Mar 24 '22 at 05:08
I learnt that the inconsistencies in the provided strings is way too much. I am required to actually match the "meaning" of two statements and if the two statements mean the same, in which ever order they are written, I am supposed to print a "Match" output. I believe this is ridiculous for a beginner programmer like me. — Rav3H34rt, Apr 11 '22 at 05:52
actually, this gonna be so hard if you want to match them by meaning because it is about natural language processing (NLP) and you must know something about AI. but I let you know if I found something. — Amir Shamsi, Apr 11 '22 at 09:14
@Rav3H34rt Check the code now I found and write a function for you. hoop it helps — Amir Shamsi, Apr 11 '22 at 12:18
Thanks for the time taken. Really sorry that I did not get back sooner. The I had a discussion with my team in this regards. I did try your method for the statements and did gets some useful output. But the case as you have mentioned is all about Matching the meaning of two strings. Instead, we had decided to go ahead with printing out the change and let the user decide if they are fine with it or not. Its a bummer, and I am sorry for that. \ — Rav3H34rt, Aug 24 '22 at 06:17

norok2 · Answer 2 · 2022-04-12T08:57:23.983

It looks like you are trying to use tools like fuzzywuzzy that were not really designed for that.

One possible approach to this problem could be to find how many tokens from the second text are present in the first text. This can be normalized by the total number of tokens in the second text. Then you can threshold to whatever value you deem fit.

One possible way of implementing this is the following:

Tokenize (i.e. convert to a list of tokens) the input texts a and b.
Collect each token list into a corresponding counter (i.e. some data structure for counting the which tokens are present).
Compute the intersection a_i_b of the tokens for a and b.
Compute some metric based on the the total occurrences of a_i_b (weight_a_i_b) and the total occurrences of b (weight_b). This final metric is a proxy of the "amount" of b contained into a. This could be a ratio or a difference and should use the fact that weight_a_i_b <= weight_b by construction.

The difference weight_b - weight_a_i_b results in a number between 0 and the number of tokens in b, which is also a direct measure of how many tokens from b are not found in a, hence 0 indicates perfect matching.

The ratio weight_a_i_b / weight_b results in a number between 0 and 1, with 1 meaning perfect matching and 0 meaning no matching.

The difference metric is probably more suited for small number of tokens and easier to interpret and threshold in a meaningful way (e.g. accepting a value below 2 means that at most one token from b is not present in a).

On the other hand the ratio is more standard and it is probably more suited for larger tokens lists sizes.

All this would translate into this code, leveraging collections.Counter() for the dealing with counting the tokens:

import collections


def contains_tokens(
        text_a,
        text_b,
        tokenize_kws=None,
        metric=lambda a, b, a_i_b: b - a_i_b):
    """Compute the ratio of `b` contained in `a`."""
    tokenize_kws = dict(tokenize_kws) if tokenize_kws is not None else {}
    counter_a = collections.Counter(tokenize(text_a, **tokenize_kws))
    counter_b = collections.Counter(tokenize(text_b, **tokenize_kws))
    counter_a_i_b = counter_a & counter_b
    weight_a = counter_total(counter_a)
    weight_b = counter_total(counter_b)
    weight_a_i_b = counter_total(counter_a_i_b)
    return metric(weight_a, weight_b, weight_a_i_b)

The first step, i.e. tokenization, is achieved with the following function. This is a bit primitive, but gets the job done for the your input. What it does essentially is to replace a number of special characters (ignores) into blanks, and then splits the string along the blanks, optionally excluding the tokens in a blacklist (excludes).

def tokenize(
        text,
        case_sensitive=False,
        ignores=('_', '-', ':', ',', '.', '?', '!'),
        excludes=('the', 'from', 'to')):
    """Tokenize a text, ignoring some characters and excluding some tokens."""
    if not case_sensitive:
        text = text.lower()
    for ignore in ignores:
        text = text.replace(ignore, ' ')
    for token in text.split():
        if token not in excludes:
            yield token

To count the total number of values in a counter the following function is used. However, for Python 3.10 and later, there is a build-in method Counter.total() which does the exact same.

def counter_total(counter):
    """Count the total number of values."""
    return sum(counter.values())

For the given input this becomes:

a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'


# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0

and

a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'


# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0

a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos2'


# one token from `b` (`pos2`) not in `a`
print(contains_tokens(a, b))
# 1

Note that distance-based functions (like fuzz.token_set_ratio() or fuzz.partial_ratio()) cannot be used in this context because they will be sensitive to how much "noise" is present in the first text, e.g. if b = 'a b c', those tokens are contained equally in a = 'a c' as well as a = 'a b c d e f g h i', and any distance cannot account for that, most notably because distance functions are symmetric (i.e. f(a, b) = f(b, a)) while the function you are looking for is not (i.e. f(a, b) != f(b, a)).

Comparing two strings with low/no consistency

2 Answers2

Initialization

Code

Usage