0

I have a very huge dataset and required to reduce the embedding of 768 dimension to 128dimension with TSNE. Since I have more than 1million rows, it takes more than weeks to complete dimension reduction on whole dataset, so I thought maybe I can separate the dataset into different parts and then perform each part separately. I do not have GPU so only CPU.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=128, init='pca', random_state=1001, perplexity=30, method='exact', n_iter=250, verbose=1)
X_tsne = tsne.fit_transform(df_dataset[:1000000]) # this will either fail or take a while (most likely overnight)
I am wondering whether my way is considered OK?

The above is not using split yet, but just load all the datasets. I just want to confirm whether splitting to multiple batches and then fit_transform each batch is the right way or not.

Also, I check the below link about whitening sentence representation but not sure whether does it work with my above method by replacing tsne with whitening. https://deep-ch.medium.com/dimension-reduction-by-whitening-bert-roberta-5e103093f782

  • If you are aware about `split` which is avaible in `Unix` . Make use of that to split into batches of `100`. Make a embedded script `Unix + Python` . Link : https://kb.iu.edu/d/afar – codeholic24 Oct 05 '22 at 08:53
  • Yes, I know how to split it using python but I am wondering whether splitting it and then fit_transofrm each batch is the right way to do it. – just want to learn Oct 05 '22 at 09:18
  • 1
    Not a *programming* question, hence off-topic here; please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Oct 05 '22 at 10:55

1 Answers1

0

It probably depends what you're trying to do, but I suspect the answer is that it is the wrong thing to do.

Between different batches it would difficult to guarantee that the reduced dimension representations would be comparable, since they would have been optimised independently, not using the same data. So you could end up with data looking similar in the low-D representation, when they aren't similar in the original representation.

It seems like PCA might be more suited to you, since it's very fast. Or UMAP, since it is also fast, but additionally has some ways to work with batched data etc.

Téo
  • 191
  • 3
  • Thank you. Yes, I tried UMAP too but then I faced another issue. I posted it here. https://stackoverflow.com/questions/73970140/kernel-dead-when-training-umap-with-large-dataset-with-high-dimension – just want to learn Oct 06 '22 at 09:58
  • It seems like you've deleted your other question, but given the info in the URL, my guess is that you hit a memory error. UMAP has a argument for specifying low-memory, maybe give that shot? Or try the method outlined in the [alignedUMAP](https://umap-learn.readthedocs.io/en/latest/aligned_umap_basic_usage.html) examples. – Téo Oct 06 '22 at 17:35
  • Thank you, actually I found the answer to it. It seems like I can use update as an alternative to make umap an incremental type – just want to learn Oct 07 '22 at 13:05
  • Hi, since I am facing the same problem, could you elaborate on what you finally did? – David Harar Jul 21 '23 at 07:37