Separate data points into clusters and taking the mean of each cluster

Question

Suppose I have some data in a 2x4 Matrix (4 data points, 2 features)

X = np.array([[4,3,5,6], = [x1 x2 x3 x4]
              [7,4,6,5]])

A "closest" 3x4 matrix which indicates the closest cluster k to each data point x. (3 clusters, 4 data points)

C = [[1 0 1 0]
     [0 0 0 1]
     [0 1 0 0]]

I would like to find an efficient way using numpy to compute the mean of the data points in each cluster.

My idea was to construct a matrix that would look like:

idea = [[x1 0  x3 0 ]
        [0  0  0  x4]
        [0  x2 0  0 ]]

Summing its elements across the columns and then dividing by the respective elements in np.sum(c,axis=1), since the mean should only take into account the data points that belong to that cluster (i.e. not the zeros).

The final expected output with this example should be a 3x2 matrix:

output = [(x1+x3)/2  = [ [4.5 6.5]
           x4            [6   5  ]
           x2       ]    [3   4  ]]

I wasn't even able to construct a matrix that looks like my idea matrix.
I don't know if this is the most efficient one can do to solve this problem

I want to avoid using any for-loops.

Can you show us the final expected output for the given sample? Can you list a working loopy solution to it? — Divakar, Jul 01 '18 at 07:39
@Divakar Sorry, you are right. Done. I was looking for a conceptual answer rather than a practical answer. In this regard, I didn't even bother trying to implement a "loopy" solution. — Adrian Guerra, Jul 01 '18 at 07:48

Alicia Garcia-Raboso · Accepted Answer · 2018-07-01T08:49:11.300

2

Here's a vectorized implementation of your strategy:

X = np.array([[4, 3, 5, 6],[7, 4, 6, 5]])
C = np.array([[1, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0]])
output = X @ C.T / np.sum(C, axis=1)

print(output)
# => [[4.5 6.  3. ]
#    [[6.5 5.  4. ]]

Since your points in X are columns, I thought it more natural to have the columns of the output be the centers of masses of the clusters. You can transpose the result if you prefer otherwise.

edited Jul 01 '18 at 08:49

answered Jul 01 '18 at 08:24

Alicia Garcia-Raboso

13,193
1
43
48

My apologies: I swapped `C` and `X` when copying my solution here. It should be fixed now. – Alicia Garcia-Raboso Jul 01 '18 at 08:49
Should be `/ np.sum(C,axis=1)[:,None] ` See https://stackoverflow.com/questions/19602187/numpy-divide-each-row-by-a-vector-element – Adrian Guerra Jul 01 '18 at 08:50
1

`X @ C.T / np.sum(C, axis=1)` or its transpose `C @ X.T / np.sum(C, axis=1)[: None]`, though for the second I would prefer `C @ X.T / np.sum(C, axis=1, keepdims=True)` – Alicia Garcia-Raboso Jul 01 '18 at 08:54

Separate data points into clusters and taking the mean of each cluster

1 Answers1