I'm using DEC from mxnet (https://github.com/apache/incubator-mxnet/tree/master/example/deep-embedded-clustering)
While it defaults to run on the MNIST, I have changed the datasource to several hundreds of documents (which should be perfectly fine, given that mxnet can work with the Reuters dataset)
The question; after training MXNET, how can I use it on new, unseen data? It shows me a new prediction each time!
Here is the code for collecting the dataset:
vectorizer = TfidfVectorizer(dtype=np.float64, stop_words='english', max_features=2000, norm='l2', sublinear_tf=True).fit(training)
X = vectorizer.transform(training)
X = np.asarray(X.todense()) # * np.sqrt(X.shape[1])
Y = np.asarray(labels)
Here is the code for prediction:
def predict(self, TrainX, X, update_interval=None):
N = TrainX.shape[0]
if not update_interval:
update_interval = N
batch_size = 256
test_iter = mx.io.NDArrayIter({'data': TrainX}, batch_size=batch_size, shuffle=False,
last_batch_handle='pad')
args = {k: mx.nd.array(v.asnumpy(), ctx=self.xpu) for k, v in self.args.items()}
z = list(model.extract_feature(self.feature, args, None, test_iter, N, self.xpu).values())[0]
kmeans = KMeans(self.num_centers, n_init=20)
kmeans.fit(z)
args['dec_mu'][:] = kmeans.cluster_centers_
print(args)
sample_iter = mx.io.NDArrayIter({'data': X})
z = list(model.extract_feature(self.feature, args, None, sample_iter, N, self.xpu).values())[0]
p = np.zeros((z.shape[0], self.num_centers))
self.dec_op.forward([z, args['dec_mu'].asnumpy()], [p])
print(p)
y_pred = p.argmax(axis=1)
self.y_pred = y_pred
return y_pred
Explanation: I thought I also need to pass a sample of the data I trained the system with. That is why you see both TrainX and X there.
Any help is greatly appreciated.