First to tell about the background: several methods for comparison between clusterings relies on so called pair counting. We have two vectors of flat clusterings a
and b
over the same n
entities. At pair counting for all possible pairs of entities we check whether if those belong to the same cluster in both, or to the same in a
but different in b
, or the opposite, or to different clusters in both. This way we get 4 counts, let's call them n11, n10, n01, n00
. These are input for different metrics.
When the number of entities is around 10,000, and the number of clusterings to compare is dozens or more, the performance becomes an issue, as the number of pairs is 10^8
for each comparison, and for an all-to-all comparison of clusterings this needs to be performed 10^4
times.
With a naive Python implementation it took really forever, so I turned to Cython and numpy. This way I could push the time for one comparison down to around 0.9-3.0
seconds. Which still means half day runtime in my case. I am wondering if you see any possibility for performance achievement, with some clever algorithm or C library, or whatever.
Here are my attempts:
1) This counts without allocating huge arrays for all the pairs, takes 2 membership vectors a1, a2
of length n
, and returns the counts:
cimport cython
import numpy as np
cimport numpy as np
ctypedef np.int32_t DTYPE_t
@cython.boundscheck(False)
def pair_counts(
np.ndarray[DTYPE_t, ndim = 1] a1,
np.ndarray[DTYPE_t, ndim = 1] a2,
):
cdef unsigned int a1s = a1.shape[0]
cdef unsigned int a2s = a2.shape[0]
cdef unsigned int n11, n10, n01, n00
n11 = n10 = n01 = n00 = 0
cdef unsigned int j0
for i in range(0, a1s - 1):
j0 = i + 1
for j in range(j0, a2s):
if a1[i] == a1[j] and a2[i] == a2[j]:
n11 += 1
elif a1[i] == a1[j]:
n10 += 1
elif a2[i] == a2[j]:
n01 += 1
else:
n00 += 1
return n11, n10, n01, n00
2) This first calculates comembership vectors (length n * (n-1) / 2
, one element for each entity pair) for each of the 2 clusterings, then calculates the counts from these vectors. It allocates ~20-40M memory at each comparison, but interestingly, faster than the previous. Note: c
is a wrapper class around a clustering, having the usual membership vector, but also a c.members
dict which contains the indices of entities for each cluster in numpy arrays.
cimport cython
import numpy as np
cimport numpy as np
@cython.boundscheck(False)
def comembership(c):
"""
Returns comembership vector, where each value tells if one pair
of entites belong to the same group (1) or not (0).
"""
cdef int n = len(c.memberships)
cdef int cnum = c.cnum
cdef int ri, i, ii, j, li
cdef unsigned char[:] cmem = \
np.zeros((int(n * (n - 1) / 2), ), dtype = np.uint8)
for ci in xrange(cnum):
# here we use the members dict to have the indices of entities
# in cluster (ci), as a numpy array (mi)
mi = c.members[ci]
for i in xrange(len(mi) - 1):
ii = mi[i]
# this is only to convert the indices of an n x n matrix
# to the indices of a 1 x (n x (n-1) / 2) vector:
ri = n * ii - 3 * ii / 2 - ii ** 2 / 2 - 1
for j in mi[i+1:]:
# also here, adding j only for having the correct index
li = ri + j
cmem[li] = 1
return np.array(cmem)
def pair_counts(c1, c2):
p1 = comembership(c1)
p2 = comembership(c2)
n = len(c1.memberships)
a11 = p1 * p2
n11 = a11.sum()
n10 = (p1 - a11).sum()
n01 = (p2 - a11).sum()
n00 = n - n10 - n01 - n11
return n11, n10, n01, n00
3) This is a pure numpy based solution with creating an n x n
boolean array of comemberships of entity pairs. The inputs are the membership vectors (a1, a2
).
def pair_counts(a1, a2):
n = len(a1)
cmem1 = a1.reshape([n,1]) == a1.reshape([1,n])
cmem2 = a2.reshape([n,1]) == a2.reshape([1,n])
n11 = int(((cmem1 == cmem2).sum() - n) / 2)
n10 = int((cmem1.sum() - n) / 2) - n11
n01 = int((cmem2.sum() - n) / 2) - n11
n00 = n - n11 - n10 - n01
return n11, n10, n01, n00
Edit: example data
import numpy as np
a1 = np.random.randint(0, 1868, 14440, dtype = np.int32)
a2 = np.random.randint(0, 484, 14440, dtype = np.int32)
# to have the members dicts used in example 2:
def get_cnum(a):
"""
Returns number of clusters.
"""
return len(np.unique(a))
def get_members(a):
"""
Returns a dict with cluster numbers as keys and member entities
as sorted numpy arrays.
"""
members = dict(map(lambda i: (i, []), range(max(a) + 1)))
list(map(lambda m: members[m[1]].append(m[0]),
enumerate(a)))
members = dict(map(lambda m:
(m[0], np.array(sorted(m[1]), dtype = np.int)),
members.items()))
return members
members1 = get_members(a1)
members2 = get_members(a2)
cnum1 = get_cnum(a1)
cnum2 = get_cnum(a2)