I have a very huge dataset and required to reduce the embedding of 768 dimension to 128dimension with TSNE. Since I have more than 1million rows, it takes more than weeks to complete dimension reduction on whole dataset, so I thought maybe I can separate the dataset into different parts and then perform each part separately. I do not have GPU so only CPU.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=128, init='pca', random_state=1001, perplexity=30, method='exact', n_iter=250, verbose=1)
X_tsne = tsne.fit_transform(df_dataset[:1000000]) # this will either fail or take a while (most likely overnight)
I am wondering whether my way is considered OK?
The above is not using split yet, but just load all the datasets. I just want to confirm whether splitting to multiple batches and then fit_transform each batch is the right way or not.
Also, I check the below link about whitening sentence representation but not sure whether does it work with my above method by replacing tsne with whitening. https://deep-ch.medium.com/dimension-reduction-by-whitening-bert-roberta-5e103093f782