Standard scaler produces different values before PCA

Question

I am doing a classification problem in biometrics. I am comparing with the euclidean distance each probe in the testing set with the gallery.

Everytime I run the code I get different results. If I remove the scaler I get always the same results.

Why does the scaler produce different values? (the difference is slightly, sometimes it recognizes 10 more probes, sometimes 10 less). Thanks to all who answer.

scaler = StandardScaler()
training_walks_matrix = load('training_imputeZero.npy')
training_scaled = scaler.fit_transform(training_walks_matrix)
testing_walks_matrix = load('testing_imputeZero.npy')
testing_scaled = scaler.transform(testing_walks_matrix)
pca = PCA(n_components=50).fit(training_scaled)
training_walks_matrix = pca.transform(training_scaled)
testing_walks_matrix = pca.transform(testing_scaled)

Is the input matrix that you scale always **exactly** the same? — petezurich, Apr 15 '21 at 16:54
@petezurich yes it's the same, it's a file saved with numpy. — , Apr 15 '21 at 17:20
Hmmm... Can you provide a [minimal example](https://stackoverflow.com/help/minimal-reproducible-example) to reproduce the error? — petezurich, Apr 15 '21 at 18:56
@petezurich the problem now is fixed, you can see the answer if you are curious about it. — , Apr 15 '21 at 20:44

score 1 · Accepted Answer · answered Apr 15 '21 at 19:01

The only thing that I can suspect is that probably the arpack or randomized solvers are used behind the scene in your case since this is defined automatically. In that case, you need to fix the random seed in order to reproduce the results.

Try to fix the random seed by passing a value in the input argument random_state of the PCA instance.

myseed = 0

scaler = StandardScaler()
training_walks_matrix = load('training_imputeZero.npy')
training_scaled = scaler.fit_transform(training_walks_matrix)
testing_walks_matrix = load('testing_imputeZero.npy')
testing_scaled = scaler.transform(testing_walks_matrix)

#here
pca = PCA(n_components=50, random_state=myseed).fit(training_scaled)

training_walks_matrix = pca.transform(training_scaled)
testing_walks_matrix = pca.transform(testing_scaled)

Thank you. Now all the tests I do, they have the same results. I see on the doc the attribute random_state but I don't understand what is it. Can you explain it easly? — , Apr 15 '21 at 20:42
internally the code uses random initializations. If you do not set the random seed, then each time you run it, a different seed will be used and you will get different results. To get the same results, you need to set the random seed. — seralouk, Apr 15 '21 at 20:55

Standard scaler produces different values before PCA

1 Answers1