5

I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather than single division. In order to do this, I tried to use SciPy:

R = dok_matrix((nrows, ncols), dtype=np.float32)

for user in ratings:
    for item in ratings[user]:
        R[item, user] = ratings[user][item]

Z = hierarchy.linkage(R.transpose().toarray(), method='ward')

This works fine on a small data-set:

enter image description here

However, I (obviously) get memory problems when scaling up. If there any way I can feed sparse matrix to the algorithm?

Siegmeyer
  • 4,312
  • 6
  • 26
  • 43
  • 1
    Which package are you using for the clustering? It's that code and its documentation that will tell whether it can work with sparse matrices or not. Some `scikit-learn` functions work with sparse, but not all. – hpaulj Jun 20 '17 at 18:56
  • As I said, I use SciPy. Documentation of the method does not say anything about sparse matrices. Scikit does not allow producing dendrograms in a simple fashion (correct me if I'm wrong). – Siegmeyer Jun 20 '17 at 18:58
  • Are you using a 1d compressed distance matrix? – rafaelvalle Jun 20 '17 at 19:07
  • No, I use observation matrix, not distance. – Siegmeyer Jun 20 '17 at 19:08
  • Scipy's hierarchical.linkage also accepts 1d compressed distance matrices. – rafaelvalle Jun 20 '17 at 19:10
  • I'm aware, but this is a different kind of problem. I don't have distance information, only (samples x features) matrix. – Siegmeyer Jun 20 '17 at 19:12
  • OK, the `scipy.cluster` package. – hpaulj Jun 20 '17 at 19:16
  • If you look at hierarchy.linkage's source code, you will see that it applies scipy.spatial.distance.pdist to your observational data, which returns a condensed distance matrix. I assume you are getting memory problems when computing the distance matrix, right? – rafaelvalle Jun 20 '17 at 19:17
  • It raises "ValueError: A 2-dimensional array must be passed.", not allowing me to pass anything else then ndarray. – Siegmeyer Jun 20 '17 at 19:22
  • try a ndarray of shape (N, 1). – rafaelvalle Jun 20 '17 at 19:31
  • If I put all data in a vector, how I can tell when one sample ends and another begins? – Siegmeyer Jun 20 '17 at 19:38
  • https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html – rafaelvalle Jun 20 '17 at 19:38

1 Answers1

1

From scipy/cluster/hierarchy.py linkage processes the y argument as:

y = _convert_to_double(np.asarray(y, order='c'))

if y.ndim == 1:
    distance.is_valid_y(y, throw=True, name='y')
    [y] = _copy_arrays_if_base_present([y])
elif y.ndim == 2:
    if method in _EUCLIDEAN_METHODS and metric != 'euclidean':
        raise ValueError("Method '{0}' requires the distance metric "
                         "to be Euclidean".format(method))
    y = distance.pdist(y, metric)
else:
    raise ValueError("`y` must be 1 or 2 dimensional.")

When I apply asarray to a dok I get a 0d object array. It just wraps the dictionary in an array.

In [905]: M=sparse.dok_matrix([[1,0,0,2,3],[0,0,0,0,1]])
In [906]: M
Out[906]: 
<2x5 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in Dictionary Of Keys format>
In [908]: m = np.asarray(M)
In [909]: m
Out[909]: 
array(<2x5 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in Dictionary Of Keys format>, dtype=object)
In [910]: m.shape
Out[910]: ()

linkage accepts a 1d compressed style distance matrix, or the equivalent 2d one.

Looking further in linkage I deduce that ward uses nn_chain, which is in the compiled scipy/cluster/_hierarchy.cpython-35m-i386-linux-gnu.so file. That puts the working part of the method even further out of reach of the casual Python programmer.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Well, too bad. Thanks for the help. I guess I will have to switch to something else. – Siegmeyer Jun 20 '17 at 20:32
  • I wonder if the compressed sparse graph package has anything of use, https://docs.scipy.org/doc/scipy-0.19.0/reference/sparse.csgraph.html – hpaulj Jun 24 '17 at 01:09
  • Any progress on this? My current solution is iterating through individual vectors in parallel. – sdgaw erzswer Dec 18 '18 at 14:45