I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather than single division. In order to do this, I tried to use SciPy:
R = dok_matrix((nrows, ncols), dtype=np.float32)
for user in ratings:
for item in ratings[user]:
R[item, user] = ratings[user][item]
Z = hierarchy.linkage(R.transpose().toarray(), method='ward')
This works fine on a small data-set:
However, I (obviously) get memory problems when scaling up. If there any way I can feed sparse matrix to the algorithm?