0

I want to group similar strings, however, I would prefer to be smart to catch whether conventions like '/' or '-' are diverged instead of letter differences.

Given following input:

moose
mouse
mo/os/e
m.ouse

alpha = ['/','.']

I want to group strings with respect to restricted set of letters, where output should be:

moose
mo/os/e

mouse
m.ouse

I'm aware I can get similar strings using difflib but it doesn't provide option for limiting the alphabet. Is there another way of doing this? Thank you.

Update:

Instead of restricted letters, alphas are simpler to implement by just checking for occurrences. Therefore, I've changed the title.

hurturk
  • 5,214
  • 24
  • 41
  • 1
    Have you considered casefolding and stripping away anything that's not within `a-z`? – inspectorG4dget May 25 '17 at 19:48
  • Can you add some other examples? Can `/` appear anywhere in the word, e.g., `/moose` or `moose/`? Can the restricted symbols appear more than once? Can they co-occur? – isosceleswheel May 25 '17 at 19:51
  • Yes, anywhere and they can occur more than once. I will improve the example shortly. @inspectorG4dget stripping away special chars, grouping them with exact match (maybe allowing space) may help. Thinking about it. – hurturk May 25 '17 at 19:54
  • How about making a defaultdict(list) where keys are filtered words and values are lists of words? – hello world May 25 '17 at 19:55
  • @inspectorG4dget I think your solution is simple and makes sense with exact matching. Levenshtein may not be needed at all in that case. @.hello_world, you are thinking about a similar solution as I understand. – hurturk May 25 '17 at 20:04

3 Answers3

2

Maybe something like:

from collections import defaultdict

container = defaultdict(list)
for word in words:
    container[''.join(item for item in word if item not in alpha)].append(word)
hello world
  • 596
  • 2
  • 12
  • Nice one, it works! I will wait a bit to see a if more of a non-exact solution comes along. Thank you. – hurturk May 25 '17 at 20:17
  • Looks like there is no other way around as example I provided even ignores a single letter difference, so this should be the accepted answer. – hurturk May 25 '17 at 20:25
1

Here is an idea that takes a few (easy) steps:

import re
example_strings = ['m/oose', 'moose', 'mouse', 'm.ouse', 'ca...t', 'ca..//t', 'cat']

1. Index all of your strings so you can refer back to them by index later:

indexed_strings = list(enumerate(example_strings))

2. Store all strings with restricted characters in a dictionary using index as the key, string as the value. Then remove the restricted chars temporarily for sorting:

# regex to match restricted alphabet
restricted = re.compile('[/\.]')
# dictionary to store strings with restricted char
restricted_dict = {}
for (idx, string) in indexed_strings:
    if restricted.search(string):
        # storing the string with a restricted char by its index
        restricted_dict[idx] = string
        # stripping the restricted char temporarily and returning to the list
        indexed_strings[idx] = (idx, restricted.sub('', string))

3. Sort the cleaned list of strings by string values, then iterate over the strings once more and replace the stripped strings with their original values:

indexed_strings.sort(key=lambda x: x[1])
# make a new list for the final set of strings
final_strings = []
for (idx, string) in indexed_strings:
    if idx in restricted_dict:
        final_strings.append(restricted_dict[idx])
    else:
        final_strings.append(string)

Result: ['ca...t', 'ca..//t', 'cat', 'm/oose', 'moose', 'mouse', 'm.ouse']

isosceleswheel
  • 1,516
  • 12
  • 20
  • It's interesting, but the complexity with an extra dict is totally unneeded. You use `sort` with a `key`. Why not use it with `lambda string: restricted.sub('', string)` directly? – Eric Duminil May 25 '17 at 21:01
  • @EricDuminil I agree and more generally I concede that the accepted answer is much more elegant and accomplishes everything the OP wants in a single pass :-) – isosceleswheel May 25 '17 at 21:04
  • I'm not sure the accepted answer is a single pass : it needs to iterate over every `alpha` char for every `item` in `words`. For your solution, I was just saying that `1` and `2` are unneeded. – Eric Duminil May 25 '17 at 21:07
1

Since you want to group words, you should probably use groupby.

You just need to define a function which deletes alpha chars (e.g. with str.translate), and you can apply sort and groupby to your data:

from itertools import groupby

words = ['moose', 'mouse', 'mo/os/e', 'm.ouse']
alpha = ['/','.']

alpha_table = str.maketrans('', '', ''.join(alpha))

def remove_alphas(word):
    return word.lower().translate(alpha_table)

words.sort(key=remove_alphas)
print(words)
# ['moose', 'mo/os/e', 'mouse', 'm.ouse'] # <- Words are sorted correctly.

for common_word, same_words in groupby(words, remove_alphas):
    print(common_word)
    print(list(same_words))
# moose
# ['moose', 'mo/os/e']
# mouse
# ['mouse', 'm.ouse']
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124