MemoryError from sklearn.metrics.silhouette_samples

Question

I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5.

For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that.

My input, X, is a [107545 rows x 12 columns] dataframe which I would not really consider to be big, although I do only have 8gb of RAM

sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])

C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
    167     check_number_of_labels(len(le.classes_), X.shape[0])
    168 
--> 169     distances = pairwise_distances(X, metric=metric, **kwds)
    170     unique_labels = le.classes_
    171     n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1248 
   1249 
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1088     if n_jobs == 1:
   1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    244         YY = row_norms(Y, squared=True)[np.newaxis, :]
    245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    247     distances *= -2
    248     distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 
MemoryError:

The calculation seems to rely on euclidean_distances which crashed on the call of np.dot. I am not dealing with scarcity here so maybe there is no solution. When calculating distance I normally use numpy.linalg.norm(A-B). Does this have better memory handling?

score 7 · Accepted Answer · 2018-10-05T21:12:26.260

Update: PR 11135 should resolve this issue within scikit-learn, making the rest of the post obsolete.

You have about 100000 = 1e5 samples, which are points in 12-dimensional space. The pairwise_distances method is trying to compute all pairwise distances between them. That is (1e5)**2 = 1e10 distances. Each is a floating point number; float64 format takes 8 bytes of memory. So the size of the distance matrix is 8e10 bytes, which is 74.5 GB.

This is occasionally reported on GitHub: #4701, #4197 with the answer being roughly: it's a NumPy problem that it can't handle np.dot with matrices of that size. Although there was one comment saying

it might be possible to break this up into sub-matrices to do the calculation more memory efficient.

Indeed, if instead of forming one giant distance matrix at the beginning, the method computed relevant chunks of it in the loop over labels, that would require less memory.

It is not hard to modify the method using its source so that instead of computing distances first and applying binary masks later, it masks first. This is what I did below. Instead of N**2 memory, where N is the number of samples, it requires n**2 where n is the maximal cluster size.

If this is something that looks practical, I imagine it could be added to Scikit by way of some flag... one should note that this version does not support metric='precomputed', though.

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
            intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
                other_distances = np.mean(inter_distances, axis=1)
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

This is not for the clustering method but for the analysis of the resultant clusters. In any case, at no time do I need all the distances in memory. For each point I need the average distance to all other points within a single cluster. At most this means 8e10 bytes = 0.000745 GB before the average is done. Then just loop over all points. Am I missing something or is it just that the coding is not handling memory in a way to prevent this? — Keith, Dec 08 '17 at 04:51
Mathematically yes (although you also need average distance to points in _another_ cluster). But the computation is vectorized as is typical in NumPy-based code. Vectorization is great for many things, but here it comes at the cost of memory use that can be seen as unacceptably high. I updated the answer. If you end up writing a memory-conservative form of this method, please post; it doesn't seem to be an active issue for scikit-learn where all relevant issues are closed. — , Dec 08 '17 at 12:22
I might submit some code to GitHub if I find a solution I am happy with. One would assume that the current method is faster but at some point the memory become an issue. This means that both a "memory safe" and "fast" method are needed. Would it be a bad idea to have a build in cut-off say 10,000 rows = 0.745 GB? Or should I just add a parameter? — Keith, Dec 08 '17 at 17:12
I just posted a memory_saving version, see if it works for you. If you write a PR, a parameter is probably best because everyone's CPU / memory balance is different, there is no threshold that works well for all. Also, the current method supports metric='precomputed' which the memory-saving version does not (so the "saving" parameter will need to be ignored if the metric is precomputed). — , Dec 08 '17 at 17:19
Using `intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)` is probably a bad idea here, since X might be integer datatype. Maybe compute `pairwise_distances(X[:100, :])` and use the dtype of that. Or just force np.float64... — , Dec 08 '17 at 17:48
Elevated to GitHub issue https://github.com/scikit-learn/scikit-learn/issues/10279 — Keith, Dec 10 '17 at 04:13

score 3 · Answer 2 · answered Feb 16 '18 at 13:07

I developed a memory-efficient and relatively fast solution for the euclidean distance cast using numba. This works with roughly constant memory relative to the size of the input data, and uses numba's automatic parallelization. With it a dataset of 300000 rows in 24 dimensions (which would have needed about 720GB of RAM). This can be modified to implement other distance metrics as needed.

from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

from numba import jit

@jit(nogil=True, parallel=True)
def euclidean_distances_numba(X, Y=None, Y_norm_squared=None):
    # disable checks
    XX_ = (X * X).sum(axis=1)
    XX = XX_.reshape((1, -1))

    if X is Y:  # shortcut in the common case euclidean_distances(X, X)
        YY = XX.T
    elif Y_norm_squared is not None:
        YY = Y_norm_squared
    else:
        YY_ = np.sum(Y * Y, axis=1)
        YY = YY_.reshape((1,-1))

    distances = np.dot(X, Y.T)
    distances *= -2
    distances += XX
    distances += YY
    distances = np.maximum(distances, 0)

    return np.sqrt(distances)

@jit(parallel=True)
def euclidean_distances_sum(X, Y=None):
    if Y is None:
        Y = X
    Y_norm_squared = (Y ** 2).sum(axis=1)
    sums = np.zeros((len(X)))
    for i in range(len(X)):
        base_row = X[i, :]
        sums[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).sum()

    return sums


@jit(parallel=True)
def euclidean_distances_mean(X, Y=None):
    if Y is None:
        Y = X
    Y_norm_squared = (Y ** 2).sum(axis=1)
    means = np.zeros((len(X)))
    for i in range(len(X)):
        base_row = X[i, :]
        means[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).mean()

    return means


def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_clust_dists[mask] = euclidean_distances_sum(X[mask, :]) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                other_distances = euclidean_distances_mean(X[mask, :], X[other_mask, :])
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

score 1 · Answer 3 · answered Dec 11 '17 at 06:08

The accepted answer is much better on memory than the official function. It goes from len(data)^2 to len(cluster)^2. If you have clusters large enough then this can still pose an issue. I wrote the following, which is ~len(data) but it is terribly slow.

import numpy as np
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_newest(X, labels, metric='euclidean', **kwds):                                                
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    unique_labels = le.classes_
    check_number_of_labels(len(unique_labels), X.shape[0])

    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    intra_clust_dists = np.array([np.linalg.norm( X[(labels == labels[i]), :] - point, axis = 1).mean() for i, point in enumerate(X)])
    inter_clust_dists = np.array([min([np.linalg.norm( X[(labels == label), :] - point, axis = 1).mean() for label in unique_labels if label!=labels[i]]) for i, point in enumerate(X)])

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

MemoryError from sklearn.metrics.silhouette_samples

3 Answers3

Linked