How to cluster list-of-list by distance condition in Python

Question

I have the following list of lists that contains 6 entries:

lol = [['a', 3, 1.01],
       ['x', 5, 1.00],
       ['k', 7, 2.02],
       ['p', 8, 3.00],
       ['b', 10, 1.09], 
       ['f', 12, 2.03]]

Each sublist in lol contains 3 elements:

['a', 3, 1.01]
  e1  e2  e3

The list above is already sorted according to e2 (i.e, 2nd element)

I'd like to 'cluster' the above list following roughly these steps:

Pick the lowest entry (wrt. e2) in lol as the key of first cluster
Assign that as first member of the cluster (dictionary of list)
Calculate the difference current e3 in next list with first member of existing clusters.
If the difference is less than threshold, assign that list as the member of the corresponding cluster Else, create new cluster with current list as new key.
Repeat the rest until finish

The final result will look like this, with threshold <= 0.1.

dol = {'a':['a', 'x', 'b'],
       'k':['k', 'f'],
       'p':['p']}

I'm stuck with this, what's the right way to do it:

import json
from collections import defaultdict

thres = 0.1
tmp_e3 = 0
tmp_e1 = "-"

lol = [['a', 3, 1.01], ['x', 5, 1.00], ['k', 7, 2.02],
       ['p', 8, 3.00], ['b', 10, 1.09], ['f', 12, 2.03]]

dol = defaultdict(list)
for thelist in lol:
    e1, e2, e3 = thelist

    if tmp_e1 == "-":
        tmp_e1 = e1
    else:
        diff = abs(tmp_e3 - e3)
        if diff > thres:
            tmp_e1 = e1

    dol[tmp_e1].append(e1)
    tmp_e1 = e1
    tmp_e3 = e3

print json.dumps(dol, indent=4)

@Yax: `{'a': ['a', 'x'], 'b': ['b'], 'f': ['f'], 'k': ['k'], 'p': ['p']}` — pdubois, Nov 18 '14 at 04:53
Sorry I went out to pray. You will think this through the execution of your code. Now, include `print diff > thress, thelist` in your `else` statement to see the hint it gives. — Yax, Nov 18 '14 at 05:32
Remember the first list will also meet your condition which will now make it 5. — Yax, Nov 18 '14 at 05:35
OT: I recommend `pprint.pprint(data)` instead of `print json.dumps(data)` — Dima Tisnek, Nov 18 '14 at 08:01
I think I must misunderstand your problem statement, why is data sorted according to `e2`? This column doesn't seems to be used anywhere... in a naive solution, I'd get `ax|k|p|b|f`, then if I were to re-sort the list, clusters are `xab|kf|p`, only the order of elements in a cluster is different. The question is, why not cluster by `e3` first and then pick cluster leader according to `e2`? (caveat: different results in corner cases) — Dima Tisnek, Nov 18 '14 at 08:30

score 2 · Accepted Answer · answered Nov 18 '14 at 08:16

I would first ensure lol is sorted on second element, then iterate keeping in the list only what in not in threshold from first element :

import json

thres = 0.1
tmp_e3 = 0
tmp_e1 = "-"

lol = [['a', 3, 1.01], ['x',5, 1.00],['k',7, 2.02],
       ['p',8, 3.00], ['b', 10, 1.09], ['f', 12, 2.03]]

# ensure lol is sorted
lol.sort(key = (lambda x: x[1]))
dol = {}

while len(lol) > 0:
    x = lol.pop(0)
    lol2 = []
    dol[x[0]] = [ x[0] ]
    for i in lol:
        if abs(i[2] - x[2]) < thres:
            dol[x[0]].append(i[0])
        else:
            lol2.append(i)
    lol = lol2

print json.dumps(dol, indent=4)

Result :

{
    "a": [
        "a", 
        "x", 
        "b"
    ], 
    "p": [
        "p"
    ], 
    "k": [
        "k", 
        "f"
    ]
}

score 0 · Answer 2 · answered Nov 18 '14 at 08:32

Letting e2/e3 aside, here's a rough draft.

First generator groups data by value, it does need data to be sorted by value though.

Then an example use, first raw and then with data re-sorted by value.

In [32]: def cluster(lol, threshold=0.1):
    cl, start = None, None
    for e1, e2, e3 in lol:
        if cl and abs(start - e3) <= threshold:
            cl.append(e1)
        else:
            if cl: yield cl
            cl = [e1]
            start = e3
             if cl: yield cl

In [33]: list(cluster(lol))
Out[33]: [['a', 'x'], ['k'], ['p'], ['b'], ['f']]

In [34]: list(cluster(sorted(lol, key = lambda ar:ar[-1])))
Out[34]: [['x', 'a', 'b'], ['k', 'f'], ['p']]

How to cluster list-of-list by distance condition in Python

2 Answers2