How to merge similar items in a list

Question

I haven't found anything relevant on Google, so I'm hoping to find some help here :)

I've got a Python list as follows:

[['hoose', 200], ["Bananphone", 10], ['House', 200], ["Bonerphone", 10], ['UniqueValue', 777] ...]

I have a function that returns the Levenshtein distance between 2 strings, for House and hoose it would return 2, etc.

Now I want to merge list elements that have a levenshtein score of f.e. <5, while (!) adding their scores, so for the resulting list I want the following:

[['hoose', 400], ["Bananaphone", 20], ['UniqueValue', 777], ...]

or

[['House', 400], ["Bonerphone", 20], ['UniqueValue', 777], ...]

etc.

It doesn't matter as long as their values get added.

There will only ever be 2 items in the list that are very similar, so a chain effect of any one item similar to a lot of others eating them all up isn't expected.

What do you do if you have three items `A`, `B` and `C`, where `A` is similar to `B`, `B` is similar to `C`, but `A` is not similar to `C`? — Björn Pollex, Mar 20 '11 at 17:58
I think @Space_C0wb0y's point is that what you want isn't very well defined - in that example, would you expect the counts for A, B and C to be merged? If so, and you have a dictionary-worth of words, that may end up merging the vast majority of them.... — Mark Longair, Mar 20 '11 at 18:04
@Kami: In your example, `hoose` and `House` are similar to each other, but none of them is similar to `Bananaphone`, so there is no conflict. In general, is it possible to merge more than two items? How would that be done? Do all items that are merged into one have to be similar to all other items in that merge? — Björn Pollex, Mar 20 '11 at 18:05
Also, I don't understand what you want to be done with the second item in the innermost list (which is `5` in all cases above). What if it differs between two merge-able entries? — senderle, Mar 20 '11 at 18:06
Yeah, that stumped me for a second. :) I would want all of them to be merged, since the words I have a pretty unique and I can make the merge value pretty strict to prevent such scenarios. — Kami, Mar 20 '11 at 18:07
@ senderle it is just a relict from my original list im going to edit the original question to make it more clear — Kami, Mar 20 '11 at 18:08

score 8 · Answer 1 · answered Mar 20 '11 at 18:27

To bring home the point from my comment, I just grabbed an implementation of that distance from here, and calculated some distances:

d('House', 'hoose') = 2
d('House', 'trousers') = 4
d('trousers', 'hoose') = 5

Now, suppose your threshold is 4. You would have to merge House and hoose, as well as House and trousers, but not trousers and hoose. Are you sure something like this can never happen with your data?

In the end, I think is more of a clustering problem, so you probably have to look into clustering algorithms. SciPy offers an implementation of hierarchical clustering that works with custom distance functions (be aware that this can be very slow for larger data sets - it also consumes a lot of memory).

The main problem is to decide on a measure for cluster quality, because there is not one correct solution for your problem. This paper(pdf) gives you a starting point, to understand that problem.

thanks for your help (the links are very interesting) but i had to award the solution to Mark since he actually solved my specific problem (with only 2 items ever being very similiar). Youve got a +1 by me though! LG aus Wien :) — Kami, Mar 22 '11 at 01:01
hey but with hierarchical clustering you get a group of words that are similar when what is needed is one representing the whole group. — CpILL, Feb 17 '22 at 05:59

score 4 · Accepted Answer · edited May 23 '17 at 11:45

In common with the other comments, I'm not sure that doing this makes much sense, but here's a solution that does what you want, I think. It's very inefficient - O(n²) where n is the number of words in your list - but I'm not sure there's a better way of doing it:

data = [['hoose', 200],
        ["Bananphone", 10],
        ['House', 200],
        ["Bonerphone", 10],
        ['UniqueValue', 777]]

already_merged = []

for word, score in data:
    added_to_existing = False
    for merged in already_merged:
        for potentially_similar in merged[0]:
            if levenshtein(word, potentially_similar) < 5:
                merged[0].add(word)
                merged[1] += score
                added_to_existing = True
                break
        if added_to_existing:
            break
    if not added_to_existing:
        already_merged.append([set([word]),score])

print already_merged

The output is:

[[set(['House', 'hoose']), 400], [set(['Bonerphone', 'Bananphone']), 20], [set(['UniqueValue']), 777]]

One of the obvious problems with this approach is that the word that you're considering might be close enough to many of the different sets of words that you've already considered, but this code will just lump it into the first one it finds. I've voted +1 for Space_C0wb0y's answer ;)

I guess the way to get around the non-deterministic behavior is to sort the list first ascending from the least desired in the final result to the most. I wanted shorter words so i sorted by `len(word)` and some other criteria. If you can do a count of which words gets the most other words merged into them you could sort by that as well, although this doesn't guarantee a total order. — CpILL, Feb 17 '22 at 19:59

Hugh Bothwell · Answer 3 · 2011-03-20T20:26:49.157

import Levenshtein
import operator
import cluster

class Item(object):
    @classmethod
    def fromList(cls,lst):
        return cls(lst[0][0], lst[0][1], lst[1])

    def __init__(self, name, val=0, score=0):
        super(Item,self).__init__()
        self.name     = name
        self.val      = val
        self.score    = score

    def dist(self, other):
        return 100 if other is self else Levenshtein.distance(self.name, other.name)

    def __str__(self):
        return "('{0}', {1})".format(self.name, self.val)

def main():
    myList = [
        [['hoose', 5], 200],
        [['House', 5], 200],
        [["Bananaphone", 5], 10],
        [['trousers', 5], 100]
    ]
    items = [Item.fromList(i) for i in myList]

    cl = cluster.HierarchicalClustering(items, (lambda x,y: x.dist(y)))
    for group in cl.getlevel(5):
        groupScore = sum(item.score for item in group)
        groupStr   = ', '.join(str(item) for item in group)
        print "{0}: {1}".format(groupScore, groupStr)

if __name__=="__main__":
    main()

returns

10: ('Bananaphone', 5)
500: ('trousers', 5), ('hoose', 5), ('House', 5)

score 0 · Answer 4 · answered Nov 20 '18 at 07:11

@Mark Longair I was getting some error in python 3.5, so I corrected them as below:

import Levenshtein
data = [['hoose', 200],
       ["Bananphone", 10],
       ['House', 200],
       ["Bonerphone", 10],
       ['UniqueValue', 777]]

already_merged = []

for word, score in data:
    added_to_existing = False
    for merged in already_merged:
        for potentially_similar in merged[0]:
            if Levenshtein.distance(word, potentially_similar) < 5:
                merged[0].add(word)
                merged[1] += score
                added_to_existing = True
                break
        if added_to_existing:
            break
    if not added_to_existing:
        already_merged.append([set([word]),score])

print (already_merged)

@Mark thanks for such easy solution.

score 0 · Answer 5 · answered Mar 20 '11 at 18:12

0

Blueprint:

result = dict()
for item in [[['hoose', 5], 200], [['House', 5], 200], [["Bananaphone", 5], 10], ...]:

   key = item[0] # ('hoose', 5)
   value = item[1] # 200

   if key in result:
       result[key] = 0
   result[key] += value

It might be necessary to adjust the code for unpacking the inner list items.

answered Mar 20 '11 at 18:12

:( hoose and House get assigned different hash values – Kami Mar 20 '11 at 18:15

score 0 · Answer 6 · answered Mar 20 '11 at 18:27

You didn't say the number of items in your list, but I'm guessing n^2 complexity is OK.

You also didn't say if you wanted all possible pairs to be compared or just the neighboring ones. I assume all pairs.

So here's the idea:

Take the first item, and calculate the lev score against all other items.
Merge all items which score is less than 5, by removing them from the list and summing their scores.
In the merged list, take the next item, compare that one to all items except the one you just checked.
Repeat until there are no items in the list

How to merge similar items in a list

6 Answers6

@Mark Longair I was getting some error in python 3.5, so I corrected them as below:

Linked