1

Suppose I have some data in a 2x4 Matrix (4 data points, 2 features)

X = np.array([[4,3,5,6], = [x1 x2 x3 x4]
              [7,4,6,5]]) 

A "closest" 3x4 matrix which indicates the closest cluster k to each data point x. (3 clusters, 4 data points)

C = [[1 0 1 0]
     [0 0 0 1]
     [0 1 0 0]]

I would like to find an efficient way using numpy to compute the mean of the data points in each cluster.

My idea was to construct a matrix that would look like:

idea = [[x1 0  x3 0 ]
        [0  0  0  x4]
        [0  x2 0  0 ]]

Summing its elements across the columns and then dividing by the respective elements in np.sum(c,axis=1), since the mean should only take into account the data points that belong to that cluster (i.e. not the zeros).

The final expected output with this example should be a 3x2 matrix:

output = [(x1+x3)/2  = [ [4.5 6.5]
           x4            [6   5  ]
           x2       ]    [3   4  ]]
  1. I wasn't even able to construct a matrix that looks like my idea matrix.
  2. I don't know if this is the most efficient one can do to solve this problem

I want to avoid using any for-loops.

1 Answers1

2

Here's a vectorized implementation of your strategy:

X = np.array([[4, 3, 5, 6],[7, 4, 6, 5]])
C = np.array([[1, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0]])
output = X @ C.T / np.sum(C, axis=1)

print(output)
# => [[4.5 6.  3. ]
#    [[6.5 5.  4. ]]

Since your points in X are columns, I thought it more natural to have the columns of the output be the centers of masses of the clusters. You can transpose the result if you prefer otherwise.

Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48