0

I'm working on a clustering problem. To ensure result reproducibility, we initially set the random_state parameter in KMeans() to 0. However, after updating scikit-learn from version 0.22.2 to version 1.2.2, i encountered an unexpected issue. When i ran the same code with the same dataset , the results differed from our previous run. We are uncertain about the reasons behind this inconsistency and have been unable to reproduce the initial result.

Code:

model = KMeans(n_clusters=5, init='k-means++', tol=0.0001, random_state=0, copy_x=True, algorithm='auto' )

Expected Results Number of cluster = 5

    Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4| cluster 5
            10| 20| 12| 30|45

Actual Results

Version 0.22.2 : Number of cluster = 5

    Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4| cluster 5
            10| 5| 6| 14|5

Version 1.2.2 : Number of cluster = 5

    Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4| cluster 5
            3| 7| 20| 8|2
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Okorimi Manoury
  • 114
  • 1
  • 11
  • 1
    I was checking (yes, one by one) older versions' changelogs and came across [this](https://scikit-learn.org/stable/whats_new/v0.22.html#changed-models). Also [this](https://scikit-learn.org/stable/whats_new/v0.23.html#changed-models). – doneforaiur Jul 30 '23 at 19:15
  • 2
    Source codes differ drastically between these versions. Slight changes in initialization / algorithm might result in different results. If it's important for you to find cause of these changes, you can check scikit-learn source code for these versions yourself. – Andrey Jul 30 '23 at 19:16
  • Thank you for your response. I am using init='k-means++' for initialization. Could this difference in initialization be attributed to the change between different versions of the library? – Okorimi Manoury Jul 30 '23 at 19:23
  • 1
    Weren't you using `k-means++` for the earlier version too? – doneforaiur Jul 30 '23 at 19:30
  • Yes, i do. Actually, i use the k-means++ – Okorimi Manoury Jul 30 '23 at 19:42
  • 3
    There is absolutely **no** reason why algorithms that include randomization should give identical results between different versions of any library, even with the same random seed. – desertnaut Jul 30 '23 at 22:45
  • 3
    To illustrate the sort of thing that @desertnaut is talking about, suppose that, in the new version of `kmeans`, they were to make a single extra call to `np.random.randint` for whatever reason. Even assuming the same rng and the same seed, this would shift the entire sequence of random numbers up, resulting in what would apprear for all intents and purposes to be a totally different model initialization. – Him Jul 30 '23 at 22:54
  • 1
    As @Him says; see what happens to a random forest when you include such an extra call to the random number generator (it is in R, but the lesson is the same): https://stackoverflow.com/questions/63224935/why-does-the-importance-parameter-influence-performance-of-random-forest-in-r/63468381#63468381 – desertnaut Jul 30 '23 at 22:57
  • I might be wrong but there's no such change between `1.2.2` and `0.22.2`. See [this](https://github.com/scikit-learn/scikit-learn/compare/1.2.2...0.22.2.post1). Maybe some other underlying mechanism? – doneforaiur Jul 31 '23 at 04:41

0 Answers0