4

I have a set of strings (in my case it is a column of a pandas dataframe, but it would be ok to consider alternative data structures as list/arrays/...) and I would like to get all "unique" values from that set, where unique is not exact matching but fuzzy matching based on some similarity measure. To give an example, let imagine I have this starting set of strings:

error string
Source and destination checksums do not match 213423 != 647687 transfer-failed
Source and destination checksums do not match 654766 != 987821 transfer-failed
SSL handshake after 1 attempts
SSL handshake after 1 attempts\t
SSL handshake after 1 attempts.\n
Impossible to connect to IP:PORT/PATH{1} User timeout over*
Impossible to connect to IP:PORT/PATH{2} User timeout over*

*where IP, PORT and PATH are placeholders for possibly long strings with completely different characters from option {1} to option {2}.

What I would like as an output is a list of the 3 unique patterns (I marked the third as optional since I guess it would be more tricky):

unique patterns requirement
Source and destination checksums do not match 213423 != 647687 transfer-failed mandatory
SSL handshake after 1 attempts mandatory
Impossible to connect to IP:PORT/PATH{1} User timeout over* optional

I'm aware of some methods for fuzzy matching, for example as in Levenshtein and fuzzywuzzy packages, and I think fuzzywuzzy.fuzz.partial_token_set_ratio and partial_ratio do what I want, but only for comparing 2 strings or one string to all the others (fuzzywuzzy.process.extract), as opposed to all the strings together.

I started implementing myself but I soon realised it is a bit tricky and you need careful considerations in terms of how this scales, so I was wondering whether there's already something available for this purpose. Do you have any suggestions?

Thanks in advance :)

Luca Clissa
  • 810
  • 2
  • 7
  • 27
  • 1
    Please [edit] your question to clarify what exactly you need. "unique fuzziness" is not a clear requirement, and could be interpreted in countless ways. Does your data contain fuzzy "groups", and you only want to store one representative of each group? Does your data contain continuously fuzzy equal strings, and you need to partition it to preserve some initial features? Can you provide some sample input and expected output? – MisterMiyagi Feb 26 '21 at 10:50
  • I edited the question, hope it is clearer now :) – Luca Clissa Feb 26 '21 at 12:05

1 Answers1

1

I'm sure there's a better way but using Levenshtein and only being able to compare 2 strings at a time I came up with this:

import Levenshtein as lev


RATIO_LIMIT = 0.7

strings = (
    "Source and destination checksums do not match 213423 != 647687 transfer-failed",
    "Source and destination checksums do not match 654766 != 987821 transfer-failed",
    "SSL handshake after 1 attempts",
    "SSL handshake after 1 attempts\t",
    "SSL handshake after 1 attempts.\n",
    "Impossible to connect to IP:PORT/PATH{1} User timeout over*",
    "Impossible to connect to IP:PORT/PATH{2} User timeout over*",
)

uniques = []

for string in strings:
    if not uniques:
        uniques.append(string)

    for unique in uniques:
        if lev.ratio(unique.lower().strip(), string.lower().strip()) > RATIO_LIMIT:
            break
    else:
        uniques.append(string)

print(uniques)

Now I'm sure you can mess around with the RATIO_LIMIT for better results I just picked a random number for similarity, but does this work with your different values for PATH, IP and PORT because I guess if they're too long this method won't work

Rolv Apneseth
  • 2,078
  • 2
  • 7
  • 19
  • 1
    Thanks @rolv, I was working exactly on something like this and I confirm your code does the trick. However, my original application involves hundreds of thousands of strings and looping through them would be inefficient, so I was wondering if there's already something ready-made out there that I could use out of the box without taking care of performance details and corner cases, e.g. just specifying the RATIO threshold. – Luca Clissa Mar 10 '21 at 15:46
  • 1
    Ah I see, I don't know of any but best of luck to you sorry I coultdn't be of more help – Rolv Apneseth Mar 10 '21 at 18:49