7

Is

class sklearn.cross_validation.ShuffleSplit(
    n, 
    n_iterations=10, 
    test_fraction=0.10000000000000001, 
    indices=True, 
    random_state=None
)

the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)

Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.

If ShuffleSplit is the right, one concern is that it is mentioned

Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets

Is this always the case for 10*10 fold CV?

Fabian N.
  • 3,807
  • 2
  • 23
  • 46
Flake
  • 4,377
  • 6
  • 30
  • 29

1 Answers1

10

I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:

>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
...     random_state=42)

If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):

>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
...    X, y = shuffle(X_orig, y_orig, random_state=i)
...    skf = StratifiedKFold(y, 10)
...    print cross_val_score(clf, X, y, cv=skf)
Tom
  • 42,844
  • 35
  • 95
  • 101
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks, it's exactly what I was looking for. BTW, I saw 42 many times in examples on the web page, any story for that? – Flake Nov 26 '11 at 20:11
  • 4
    You are asking the wrong question :) http://en.wikipedia.org/wiki/42_(Hitchhiker%27s_Guide_to_the_Galaxy)#Answer_to_the_Ultimate_Question_of_Life.2C_the_Universe.2C_and_Everything_.2842.29 – ogrisel Nov 26 '11 at 21:55
  • 3
    More seriously, in the examples and tests we want to have reproducible outcomes hence we fix the PRNG seed to an arbitrary value. Feel free to tweak the value, the outcome should still "look good" but sometimes slightly different (some algorithms have a non convex objective functions with several good local optima). – ogrisel Nov 27 '11 at 14:56
  • @ogrisel Hi. If I use a StratifiedShuffleSplit, do I still need the outer loop? I want to do a 10x10 SSS inside a Pipeline. – Aizzaac Nov 06 '16 at 21:07