Suppose I have some data in a 2x4 Matrix (4 data points, 2 features)
X = np.array([[4,3,5,6], = [x1 x2 x3 x4]
[7,4,6,5]])
A "closest" 3x4 matrix which indicates the closest cluster k to each data point x. (3 clusters, 4 data points)
C = [[1 0 1 0]
[0 0 0 1]
[0 1 0 0]]
I would like to find an efficient way using numpy to compute the mean of the data points in each cluster.
My idea was to construct a matrix that would look like:
idea = [[x1 0 x3 0 ]
[0 0 0 x4]
[0 x2 0 0 ]]
Summing its elements across the columns and then dividing by the respective elements in np.sum(c,axis=1)
, since the mean should only take into account the data points that belong to that cluster (i.e. not the zeros).
The final expected output with this example should be a 3x2 matrix:
output = [(x1+x3)/2 = [ [4.5 6.5]
x4 [6 5 ]
x2 ] [3 4 ]]
- I wasn't even able to construct a matrix that looks like my
idea
matrix. - I don't know if this is the most efficient one can do to solve this problem
I want to avoid using any for-loops.