I have a set of strings (in my case it is a column of a pandas dataframe, but it would be ok to consider alternative data structures as list/arrays/...) and I would like to get all "unique" values from that set, where unique is not exact matching but fuzzy matching based on some similarity measure. To give an example, let imagine I have this starting set of strings:
error string |
---|
Source and destination checksums do not match 213423 != 647687 transfer-failed |
Source and destination checksums do not match 654766 != 987821 transfer-failed |
SSL handshake after 1 attempts |
SSL handshake after 1 attempts\t |
SSL handshake after 1 attempts.\n |
Impossible to connect to IP:PORT/PATH{1} User timeout over* |
Impossible to connect to IP:PORT/PATH{2} User timeout over* |
*where IP, PORT and PATH are placeholders for possibly long strings with completely different characters from option {1} to option {2}.
What I would like as an output is a list of the 3 unique patterns (I marked the third as optional since I guess it would be more tricky):
unique patterns | requirement |
---|---|
Source and destination checksums do not match 213423 != 647687 transfer-failed | mandatory |
SSL handshake after 1 attempts | mandatory |
Impossible to connect to IP:PORT/PATH{1} User timeout over* | optional |
I'm aware of some methods for fuzzy matching, for example as in Levenshtein
and fuzzywuzzy
packages, and I think fuzzywuzzy.fuzz.partial_token_set_ratio
and partial_ratio
do what I want, but only for comparing 2 strings or one string to all the others (fuzzywuzzy.process.extract
), as opposed to all the strings together.
I started implementing myself but I soon realised it is a bit tricky and you need careful considerations in terms of how this scales, so I was wondering whether there's already something available for this purpose. Do you have any suggestions?
Thanks in advance :)