-1

how can i group values from an array with fuzzy logic matching 80%

combined_list = ['magic', 'simple power', 'matrix', 'simple aa', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'dour', 'softy'] 

yields:

['magic, magics'], ['simple pws', 'simple aa'], ['simple power'], [matrix]

here is what i have achieved but the is very different from my goal. In addition it only supports few values, what i plan to do is run it with like 50,000 records

from difflib import SequenceMatcher as sm

combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()

for x in combined_list:

    for name in combined_list:
        if(sm(None, x, name).ratio() >= 0.80):
            result_group.append(name)
        else:
            pass

    result.append(result_group)
    print(result)
    del result_group[:]


print(result)

the print result outside the loop is empty, but the result inside the loop contains the values i need. although the output is different from what i need

['magic', 'magics']]
[['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['matrix'], ['matrix'], ['matrix']]
[['madness'], ['madness'], ['madness'], ['madness']]
[['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics']]
[['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa']]
[['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek']]
[['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour']]
[['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft']]
[['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa']]

[[], [], [], [], [], [], [], [], [], [], []]
Led
  • 662
  • 1
  • 19
  • 41

3 Answers3

2

The problem is in those lines:

result.append(result_group)
print(result)
del result_group[:]

You append a list to your result, but since lists are mutable types, only a reference is stored. So when you alter the original list (result_group), you alter the reference in result as well, in your case deleting all elements. Instead, copy it like so:

result.append(result_group[:])
print(result)
del result_group[:]

Or don't delete the list elements but create a new list for every iteration:

for x in combined_list:
    result_group = []
    for name in combined_list:
        ...

result.append(result_group)

Edit: If you want to get rid of duplicates, try using a set instead of a list:

# result = list()
result = set([])

...
# result.append(result_group)
result.add(tuple(result_group))

sets always contain unique members, however, since lists are non-hashable, you need to convert them to tuples first.

Edit2: Putting it all together and checking for actual groups of 2+ members:

from difflib import SequenceMatcher as sm

combined_list = ['magic', 'simple power', 'matrix', 'madness',
                 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']

# using a set ensures there are no duplicates
result = set([])

for x in combined_list:
    result_group = []
    for name in combined_list:
        if(sm(None, x, name).ratio() >= 0.80):
            result_group.append(name)

    if len(result_group) > 1: # this gets rid of single-word groups
        result.add(tuple(result_group))

print(result)
Dux
  • 1,226
  • 10
  • 29
  • thanks Dux, but the output is still different from what i like to achieve – Led Jun 06 '18 at 15:35
  • thanks Dux, it helped me a lot. Still the output is a little off. Sorry if my question has errors. – Led Jun 06 '18 at 16:16
  • If you could PLEASE just tell us what the output is supposed to be... Are you looking to exclude words that do not have any matches? Or do you want to exclude duplicates? – Dux Jun 06 '18 at 16:19
  • hi sorry, the desired output is to group (add them into and array) words with similarities around 80%. the final output would all words grouped. example output would be: [['magic', 'magics'], ['simple power', 'simple pws'], ['matrix'], ['madness'], ['mgcsa'], ['simple power', 'simple pws'], ['seek'], ['sour'], ['soft'], ['simple aa']] – Led Jun 06 '18 at 17:16
  • @Led, maybe the only issue is that your `combined_list` in your code sample is different than what you expect? – Dux Jun 06 '18 at 17:18
  • yeah but its duplicating. I don't seem to get the words grouped in an array, Also when I first run this code, it had multiple results, that why i try to delete the group array after appending it to results. Unfortunately doing that makes the array empty. – Led Jun 06 '18 at 17:20
  • yeah, I am really not confident on that part of the script. I tried messing arround it like using iterators and stuff, but it failed – Led Jun 06 '18 at 17:22
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172615/discussion-between-dux-and-led). – Dux Jun 06 '18 at 17:22
2
from difflib import SequenceMatcher as sm

combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 
'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()
usedElements = list()
skip = False

for firstName in combined_list:
    skip = False

    for x in usedElements:
        if x == firstName:
            skip = True
    if skip == True:
        continue

    for secondName in combined_list:

        if(sm(None, firstName, secondName).ratio() >= 0.80):            

            result_group.append(secondName)
            usedElements.append(secondName)
        else:
            pass

    result.append(result_group[:])
    del result_group[:]

print(result)

I added a way to remove duplicates by tossing elements from the list that are already placed into a group into the usedElements list.

It does keep groups of one, but if you don't want elements not in a group you can just change the last segment of code to this:

    if len(result_group) > 1:
        result.append(result_group[:])
        del result_group[:]
    del result_group[:]

print(result)

Hope this helps.

Austin B
  • 184
  • 1
  • 1
  • 8
0
from difflib import SequenceMatcher as sm

combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
combined_list.sort()


def getPairs(combined_list):
    results = list()
    grouped = set()
    for x in combined_list:
        result_group = list()
        if(grouped.__contains__(x)):
            continue
        for name in combined_list:
            if(sm(None, x, name).ratio() >= 0.80):
                result_group.append(name)
                grouped.add(name);
            else:
                pass;

        results.append(result_group)
    return results;

print(getPairs(combined_list))
marya
  • 94
  • 6