0

I'm using DEC from mxnet (https://github.com/apache/incubator-mxnet/tree/master/example/deep-embedded-clustering)

While it defaults to run on the MNIST, I have changed the datasource to several hundreds of documents (which should be perfectly fine, given that mxnet can work with the Reuters dataset)

The question; after training MXNET, how can I use it on new, unseen data? It shows me a new prediction each time!

Here is the code for collecting the dataset:

vectorizer = TfidfVectorizer(dtype=np.float64, stop_words='english', max_features=2000, norm='l2', sublinear_tf=True).fit(training)

X = vectorizer.transform(training)
X = np.asarray(X.todense()) # * np.sqrt(X.shape[1])

Y = np.asarray(labels)

Here is the code for prediction:

def predict(self, TrainX, X, update_interval=None):
    N = TrainX.shape[0]
    if not update_interval:
        update_interval = N
    batch_size = 256
    test_iter = mx.io.NDArrayIter({'data': TrainX}, batch_size=batch_size, shuffle=False,
                                  last_batch_handle='pad')
    args = {k: mx.nd.array(v.asnumpy(), ctx=self.xpu) for k, v in self.args.items()}
    z = list(model.extract_feature(self.feature, args, None, test_iter, N, self.xpu).values())[0]
    kmeans = KMeans(self.num_centers, n_init=20)
    kmeans.fit(z)

    args['dec_mu'][:] = kmeans.cluster_centers_
    print(args)

    sample_iter = mx.io.NDArrayIter({'data': X})
    z = list(model.extract_feature(self.feature, args, None, sample_iter, N, self.xpu).values())[0]
    p = np.zeros((z.shape[0], self.num_centers))
    self.dec_op.forward([z, args['dec_mu'].asnumpy()], [p])
    print(p)
    y_pred = p.argmax(axis=1)

    self.y_pred = y_pred
    return y_pred

Explanation: I thought I also need to pass a sample of the data I trained the system with. That is why you see both TrainX and X there.

Any help is greatly appreciated.

David Niki
  • 1,092
  • 1
  • 11
  • 14

1 Answers1

0

Clustering methods (by themselves) don't provide a method for labelling samples that weren't included in the calculation for deriving the clusters. You could re-run the clustering algorithm with the new samples, but the clusters are likely to change and be given different cluster labels due to different random initializations. So this is probably why you're seeing different predictions each time.

One option is to use the cluster labels from the clustering method in a supervised way, to predict the cluster labels for new samples. You could find the closest cluster center to your new sample (in the feature space) and use that as the cluster label, but this ignores the shape of the clusters. A better solution would be to train a classification model to predict the cluster labels for new samples given the previously clustered data. Success of these methods will depend on the quality of your clustering (i.e. the feature space used, separability of clusters, etc).

Thom Lane
  • 993
  • 9
  • 9