1

2x2 contingency matrix:

     Cj
    2  1
Ci
    1  0

Translates to:

[[ 0 0 0 1 ]
 [ 0 0 1 0 ]]

The contingency matrix represents the outcome of two clustering algorithms, each with two clusters. The first row indicates that Ci has three data points in, say, cluster 1 and one data point in, say, cluster 2. Cj has three data points in, say, cluster A and 1 data point in, say, cluster B. Therefore, both algorithms "agree" on two out of N = 4 data points.

Since there does not exist an adjusted mutual information function that takes in the contingency matrix as input, I would like to transform the contingency matrix to 1d inputs for the sklearn implementation of AMI.

Is there an efficient way to re-write a NxN contingency matrix in 1D vector form in Python code?

It would look something like:

V1
V2
For i row index 
  For j column index
     Append as many as contingency_ij elements with value i to V1 and with value j to V2

The output should always be two vectors. Another example:

2 0 0
0 1 0
0 0 1

Would lead to two 1D vectors:

0 0 1 2
0 0 1 2
  • 2
    I have no idea what you're asking. You've posted LaTeX code there -- is that relevant to the question at all? You can't really express a 2D matrix in 1D, but of course Python supports 2D matrices. What do you expect to DO with this data? – Tim Roberts Jul 20 '22 at 18:47
  • @Tim I imagine OP tried to format their matrix. It would be better to use a markdown table, or simple text in between triple backticks. – mozway Jul 20 '22 at 18:51
  • I think some details on the logic would be helpful – mozway Jul 20 '22 at 18:59
  • If you can explain how `[[2,1],[1,0]]` becomes `[[0,0,0,1],[0,0,1,0]]`, then I'm sure we can come up with code to do it. Neither of those is 1D, of course. – Tim Roberts Jul 20 '22 at 20:32
  • @TimRoberts Indeed, LaTeX was for formatting purposes. The contingency matrix represents two clustering outcomes, each having two clusters. But I'll edit the question. – Sean_TBI_Research Jul 20 '22 at 21:53
  • Please provide a reference implementation which includes inputs and outputs. – Mad Physicist Jul 20 '22 at 22:11

2 Answers2

1

Well, this solves the problem as you have stated it. The final matrix v can be converted to numpy. v would need as many empty elements as there are dimensions in c.


def produce_vectors( c ):
    v = [[],[]]

    for i,row in enumerate(c):
        for j,val in enumerate(row):
            v[0].extend( [i]*val )
            v[1].extend( [j]*val )
    return v

c = [[2,1],[1,0]]
print(produce_vectors(c))
c = [[2,0,0],[0,1,0],[0,0,1]]
print(produce_vectors(c))

Output:

[[0, 0, 0, 1], [0, 0, 1, 0]]
[[0, 0, 1, 2], [0, 0, 1, 2]]
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • The final output should ALWAYS be two vectors, even if c is larger than 2x2 or non-squared, as input for sklearn.metrics.adjusted_mutual_info_score(labels_true, labels_pred, *, average_method='arithmetic') – Sean_TBI_Research Jul 21 '22 at 07:23
  • Is there a way to make the size of v flexible depending on c? – Sean_TBI_Research Jul 21 '22 at 12:24
  • It already does that. Have you looked at the code? The two output vectors grow as required. Each will end up as long as the sum of all the values in `c`. The NUMBER of vectors in `v` depends only on the number of DIMENSIONS in `c`. Since it is 2D, there will be 2 vectors. – Tim Roberts Jul 21 '22 at 17:42
0

A numpy implementation could take advantage of numpy.repeat:

# input contingency matrix
a = np.array([[2,1],[1,0]])
# fixed "cluster id" matrix
b = np.array([[0,1],[0,1]])
out = np.vstack([np.repeat(b.ravel('F'), a.ravel()),
                 np.repeat(b.ravel(), a.ravel())
                 ])

Output:

array([[0, 0, 0, 1],
       [0, 0, 1, 0]])

Other example with [[5,4],[0,3]] as input:

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])

You can also use cluster ids other that 0/1, if wanted (example with a = np.array([[5,4],[0,3]]) ; b = np.array([[0,1],[2,3]])):

array([[0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 3],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3]])
mozway
  • 194,879
  • 13
  • 39
  • 75