3

First to tell about the background: several methods for comparison between clusterings relies on so called pair counting. We have two vectors of flat clusterings a and b over the same n entities. At pair counting for all possible pairs of entities we check whether if those belong to the same cluster in both, or to the same in a but different in b, or the opposite, or to different clusters in both. This way we get 4 counts, let's call them n11, n10, n01, n00. These are input for different metrics.

When the number of entities is around 10,000, and the number of clusterings to compare is dozens or more, the performance becomes an issue, as the number of pairs is 10^8 for each comparison, and for an all-to-all comparison of clusterings this needs to be performed 10^4 times.

With a naive Python implementation it took really forever, so I turned to Cython and numpy. This way I could push the time for one comparison down to around 0.9-3.0 seconds. Which still means half day runtime in my case. I am wondering if you see any possibility for performance achievement, with some clever algorithm or C library, or whatever.

Here are my attempts:

1) This counts without allocating huge arrays for all the pairs, takes 2 membership vectors a1, a2 of length n, and returns the counts:

cimport cython
import numpy as np
cimport numpy as np

ctypedef np.int32_t DTYPE_t

@cython.boundscheck(False)
def pair_counts(
    np.ndarray[DTYPE_t, ndim = 1] a1,
    np.ndarray[DTYPE_t, ndim = 1] a2,
    ):

    cdef unsigned int a1s = a1.shape[0]
    cdef unsigned int a2s = a2.shape[0]

    cdef unsigned int n11, n10, n01, n00
    n11 = n10 = n01 = n00 = 0
    cdef unsigned int j0

    for i in range(0, a1s - 1):
        j0 = i + 1
        for j in range(j0, a2s):
            if a1[i] == a1[j] and a2[i] == a2[j]:
                n11 += 1
            elif a1[i] == a1[j]:
                n10 += 1
            elif a2[i] == a2[j]:
                n01 += 1
            else:
                n00 += 1

    return n11, n10, n01, n00

2) This first calculates comembership vectors (length n * (n-1) / 2, one element for each entity pair) for each of the 2 clusterings, then calculates the counts from these vectors. It allocates ~20-40M memory at each comparison, but interestingly, faster than the previous. Note: c is a wrapper class around a clustering, having the usual membership vector, but also a c.members dict which contains the indices of entities for each cluster in numpy arrays.

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
def comembership(c):
    """
    Returns comembership vector, where each value tells if one pair
    of entites belong to the same group (1) or not (0).
    """
    cdef int n = len(c.memberships)
    cdef int cnum = c.cnum
    cdef int ri, i, ii, j, li

    cdef unsigned char[:] cmem = \
        np.zeros((int(n * (n - 1) / 2), ), dtype = np.uint8)

    for ci in xrange(cnum):
        # here we use the members dict to have the indices of entities
        # in cluster (ci), as a numpy array (mi)
        mi = c.members[ci]
        for i in xrange(len(mi) - 1):
            ii = mi[i]
            # this is only to convert the indices of an n x n matrix
            # to the indices of a 1 x (n x (n-1) / 2) vector:
            ri = n * ii - 3 * ii / 2 - ii ** 2 / 2 - 1
            for j in mi[i+1:]:
                # also here, adding j only for having the correct index
                li = ri + j
                cmem[li] = 1

    return np.array(cmem)

def pair_counts(c1, c2):
    p1 = comembership(c1)
    p2 = comembership(c2)
    n = len(c1.memberships)

    a11 = p1 * p2

    n11 = a11.sum()
    n10 = (p1 - a11).sum()
    n01 = (p2 - a11).sum()
    n00 = n - n10 - n01 - n11

    return n11, n10, n01, n00

3) This is a pure numpy based solution with creating an n x n boolean array of comemberships of entity pairs. The inputs are the membership vectors (a1, a2).

def pair_counts(a1, a2):

    n = len(a1)
    cmem1 = a1.reshape([n,1]) == a1.reshape([1,n])
    cmem2 = a2.reshape([n,1]) == a2.reshape([1,n])

    n11 = int(((cmem1 == cmem2).sum() - n) / 2)
    n10 = int((cmem1.sum() - n) / 2) - n11
    n01 = int((cmem2.sum() - n) / 2) - n11
    n00 = n - n11 - n10 - n01

    return n11, n10, n01, n00

Edit: example data

import numpy as np

a1 = np.random.randint(0, 1868, 14440, dtype = np.int32)
a2 = np.random.randint(0, 484, 14440, dtype = np.int32)

# to have the members dicts used in example 2:
def get_cnum(a):
    """
    Returns number of clusters.
    """
    return len(np.unique(a))

def get_members(a):
    """
    Returns a dict with cluster numbers as keys and member entities
    as sorted numpy arrays.
    """
    members = dict(map(lambda i: (i, []), range(max(a) + 1)))
    list(map(lambda m: members[m[1]].append(m[0]),
        enumerate(a)))
    members = dict(map(lambda m:
       (m[0], np.array(sorted(m[1]), dtype = np.int)),   
       members.items()))
    return members

members1 = get_members(a1)
members2 = get_members(a2)
cnum1 = get_cnum(a1)
cnum2 = get_cnum(a2)
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
deeenes
  • 4,148
  • 5
  • 43
  • 59
  • 1
    Could you add a representative sample case that we can play around with? – Divakar Sep 28 '16 at 08:38
  • sure, give me a little time, I will add – deeenes Sep 28 '16 at 08:39
  • probably parallelization is the way to go... and for such operations your GPU might be even faster than your CPU, that could save you some big amount of time – fr_andres Sep 28 '16 at 08:43
  • @fr_andres: this is what I want to find out, whether it is possible to improve code, or this is already good and only parallelization can speed further up. I have zero experience with using GPU, could you maybe point on some introduction materials? – deeenes Sep 28 '16 at 08:50
  • 1
    In your first attempt you haven't typed `i` and `j`. It may manage to infer these automatically, but if not it'll cost you quite a bit. – DavidW Sep 28 '16 at 08:51
  • 1
    Probably the easiest way to GPU accelerate Python is to use Numba (but I don't think I can produce an example personally) – DavidW Sep 28 '16 at 09:18
  • @DavidW: thanks for Numba, looks interesting! – deeenes Sep 28 '16 at 09:45
  • @Divakar: I added example data, its size I took from the real data – deeenes Sep 28 '16 at 09:46
  • Didn't know neither about Numba, looks great! I'll try it out with TensorFlow on your examples when I find time https://www.tensorflow.org/versions/r0.10/how_tos/using_gpu/index.html – fr_andres Sep 28 '16 at 10:04

3 Answers3

1

The approach based on sorting has merit, but can be done a lot simpler:

def pair_counts(a, b):
    n = a.shape[0]  # also b.shape[0]

    counts_a = np.bincount(a)
    counts_b = np.bincount(b)
    sorter_a = np.argsort(a)

    n11 = 0
    same_a_offset = np.cumsum(counts_a)
    for indices in np.split(sorter_a, same_a_offset):
        b_check = b[indices]
        n11 += np.count_nonzero(b_check == b_check[:,None])

    n11 = (n11-n) // 2
    n10 = (np.sum(counts_a**2) - n) // 2 - n11
    n01 = (np.sum(counts_b**2) - n) // 2 - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

If this method is coded efficiently in Cython there's another speedup (probably ~20x) to be gained.


Edit:

I found a way to completely vectorize the procedure and lower the time complexity:

def sizes2count(a, n):
    return (np.inner(a, a) - n) // 2

def pair_counts_vec_nlogn(a, b):
    # Size of "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b  # make sure max(a)*max(b) fits the dtype!
    _, sizes = np.unique(ab, return_counts=True)

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(sizes, n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

def pair_counts_vec_linear(a, b):
    # Label "11" clusters (a[i]==a[j] & b[i]==b[j])
    ab = a * b.max() + b

    # Calculate the counts for each type of pairing
    n = len(a)  # also len(b)
    n11 = sizes2count(np.bincount(ab), n)
    n10 = sizes2count(np.bincount(a), n) - n11
    n01 = sizes2count(np.bincount(b), n) - n11
    n00 = n**2 - n - n11 - n10 - n01

    return n11, n10, n01, n00

Sometimes the O(n log(n)) algorithm is faster than the linear one, because the linear one uses max(a)*max(b) storage. Naming can probably be improved, I'm not really familiar with the terminology.

user6758673
  • 579
  • 2
  • 4
  • cool, it is at least 50-100x faster than my solution was. with cython I could not improve much, but it is already enough fast for my purpose. your 2nd version I will test later. thank you so much! – deeenes Sep 29 '16 at 09:16
1

To compare two clusterings A and B in linear time:

  1. Iterate through clusters in A. Let the size of each cluster be a_i. The total number of pairs in the same cluster in A is the total of all a_i*(a_i-1)/2.
  2. Partition each A-cluster according to its cluster in B. Let the size of each partition be b_j. The total number of pairs in the same cluster in both A and B is the total of all b_j *(b_j-1)/2.
  3. The difference between the two is the total number of pairs that are in the same cluster in A but not B
  4. Iterate through the custers in B to get the total number of pairs in the same cluster in B, and subtract the result from (2) to get pairs in the same cluster in B but not A.
  5. The sum of the above 3 results is the number of pairs that are the same in either A or B. Subtract from n*(n-1)/2 to get the total number of pairs that are in different clusters in A and B

The partitioning in step (2) is done by making a dictionary mapping item -> cluster for B and then looking up each item in each A-cluster. If you're cross-comparing lots of clusterings, you can save a lot of time by computing these maps just once for each clustering and keeping them around.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • thanks @Matt! this is actually almost the same as @Anony-Mousse solution, and looks ok, although at the end I keep using user6758673 solution, this too brings great improvement. – deeenes Sep 29 '16 at 09:22
1

You do not need to enumerate and count the pairs.

Instead, compute a confusion matrix containing the intersection sizes of each cluster from the first clustering with every cluster of the second clustering (this is one loop over all objects), then compute the number of pairs from this matrix using the equation n*(n-1)/2.

This reduces your runtime from O(n^2) to O(n), so it should give you a considerable speedup.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • thanks, great idea! i implemented it, and actually is slower than @user6758673 solution, but probably only because my sloppy code for building the confusion matrix. – deeenes Sep 29 '16 at 09:18
  • 1
    Use a numpy int array, and increment at `array[c1][c2]` where `c1` and `c2` are the cluster numbers of the objects. – Has QUIT--Anony-Mousse Sep 29 '16 at 18:28