We developed a Jupyter Notebook in a local machine to train models with the Python (V3) libraries sklearn
and gensim
.
As we set the random_state
variable to a fixed integer, the results were always the same.
After this, we tried moving the notebook to a workspace in Azure Machine Learning Studio (classic), but the results differ even if we leave the random_state
the same.
As suggested in the following links, we installed the same libraries versions and checked the MKL
version was the same and the MKL_CBWR
variable was set to AUTO
.
t-SNE generates different results on different machines
Same Python code, same data, different results on different machines
Still, we are not able to get the same results.
What else should we check or why is this happening?
Update
If we generate a pkl
file in the local machine and import it in AML, the results are the same (as the intention of the pkl file is).
Still, we are looking to get the same results (if possible) without importing the pkl file.
Library versions
gensim 3.8.3.
sklearn 0.19.2.
matplotlib 2.2.3.
numpy 1.17.2.
scipy 1.1.0.
Code
Full code can be found here, sample data link inside.
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from gensim.models import KeyedVectors
%matplotlib inline
import time
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns
wordvectors_file_vec = '../libraries/embeddings-new_large-general_3B_fasttext.vec'
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec)
math_quests = # some transformations using wordvectors
df_subset = pd.DataFrame()
pca = PCA(n_components=3, random_state = 42)
pca_result = pca.fit_transform(mat_quests)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1]
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300, random_state = 42)
tsne_results = tsne.fit_transform(mat_quests)
df_subset['tsne-2d-one'] = tsne_results[:,0]
df_subset['tsne-2d-two'] = tsne_results[:,1]
pca_50 = PCA(n_components=50, random_state = 42)
pca_result_50 = pca_50.fit_transform(mat_quests)
print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300, random_state = 42)
tsne_pca_results = tsne.fit_transform(pca_result_50)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))