11

What is the difference between: StratifiedKFold, StratifiedShuffleSplit, StratifiedKFold + Shuffle? When should I use each one? When I get a better accuracy score? Why I do not get similar results? I have put my code and the results. I am using Naive Bayes and 10x10 cross-validation.

   #######SKF FOR LOOP########
from sklearn.cross_validation import StratifiedKFold
for i in range(10):
    skf = StratifiedKFold(y, n_folds=10, shuffle=True)
    scoresSKF2 = cross_validation.cross_val_score(clf, x, y , cv=skf)
    print(scoresSKF2)
    print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
    print("") 

    [ 0.1750503   0.16834532  0.16417051  0.18205424  0.1625758   0.1750939
      0.15495808  0.1712963   0.17096494  0.16918166]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.17956835  0.17309908  0.17686093  0.17239388  0.16093615
     0.16970223  0.16956019  0.15473776  0.17208358]
   Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17102616  0.16719424  0.1733871   0.16560877  0.166041    0.16122508
     0.16767852  0.17042824  0.18719212  0.1677307 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.16633094  0.16906682  0.17570687  0.17210511  0.15515747
     0.16594391  0.18113426  0.16285135  0.1746953 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.1764875   0.17035971  0.16186636  0.1644547   0.16632977  0.16469229
     0.17635155  0.17158565  0.17849899  0.17005223]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16815177  0.16863309  0.17309908  0.17368725  0.17152758  0.16093615
     0.17143683  0.17158565  0.16574906  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16786433  0.16690647  0.17309908  0.17022504  0.17066128  0.16613695
     0.17259324  0.17737269  0.16256158  0.17643645]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.16402878  0.17684332  0.16791691  0.16950621  0.1716267
      0.18328997  0.16984954  0.15792524  0.17701683]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16958896  0.16633094  0.17165899  0.17080208  0.16026567  0.17538284
     0.17490604  0.16840278  0.17502173  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.15625899  0.17713134  0.16762839  0.18278949  0.16729269
     0.16449841  0.17303241  0.16111272  0.1610563 ]
    Accuracy SKF_NB: 0.17 (*/- 0.02)


  #####StratifiedKFold + Shuffle######
  from sklearn.utils import shuffle
  for i in range(10):
      X, y = shuffle(x, y, random_state=i)
      skf = StratifiedKFold(y, 10)
      scoresSKF2 = cross_validation.cross_val_score(clf, X, y , cv=skf)
      print(scoresSKF2)
      print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
      print("")

   [ 0.16700201  0.15913669  0.16359447  0.17772649  0.17297141  0.16931523
    0.17172593  0.18576389  0.17125471  0.16134649]
    Accuracy SKF_NB: 0.17 (*/- 0.02)

    [ 0.02874389  0.02705036  0.02592166  0.02740912  0.02714409  0.02687085
     0.02891009  0.02922454  0.0260794   0.02814858]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02848921  0.02361751  0.02942874  0.02598903  0.02947125
     0.02804279  0.02719907  0.02376123  0.02205456]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

   [ 0.02788158  0.02848921  0.03081797  0.03289094  0.02829916  0.03293846
     0.02862099  0.02633102  0.03245436  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02874389  0.0247482   0.02448157  0.02625505  0.02483396  0.02860445
     0.02948829  0.02604167  0.02665894  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02705036  0.02476959  0.02510098  0.02454519  0.02687085
      0.02254987  0.02199074  0.02492031  0.02524666]
    Accuracy SKF_NB: 0.02 (*/- 0.00)

    [ 0.02615694  0.03079137  0.02102535  0.03029429  0.02252382  0.02889338
       0.02197167  0.02604167  0.02752825  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.02673182  0.02676259  0.03197005  0.03115984  0.02512273  0.03236059
      0.02688638  0.02372685  0.03216459  0.02698781]
     Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.0258695   0.02964029  0.03081797  0.02740912  0.02916546  0.02976018
      0.02717548  0.02922454  0.02694871  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.03506755  0.0247482   0.02592166  0.02740912  0.02772163  0.02773765
      0.02948829  0.0234375   0.03332367  0.02118398]
    Accuracy SKF_NB: 0.03 (*/- 0.01)


    ######StratifiedShuffleSplit##########
   from sklearn.cross_validation import StratifiedShuffleSplit
   for i in range(10):
       sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=0)
       scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
       print(scoresSSS)
       print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSSS.mean(), scoresSSS.std()* 2))
       print("")
    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
      Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)
Aizzaac
  • 3,146
  • 8
  • 29
  • 61
  • 1
    Can you show us your output? Are you running this one time and comparing accuracy scores or are you running it many times and comparing average accuracy scores for each? – user6275647 Jun 05 '16 at 10:09
  • Okey, I have added the output. I am running a 10x10 cross-validation and comparing average accuracy scores. – Aizzaac Jun 05 '16 at 12:37

2 Answers2

11

Its hard to say which one is better. The choice should be more about your strategy and goals of your modeling, however there is a strong preference in the community for using K-fold cross-validation for both model selection and performance estimation. I will try to give you some intuition for the two main concepts that will guide your choice of sampling techniques: Stratification and Cross Validation/Random Split.

Also keep in mind you can use these sampling techniques for two very different goals: model selection and performance estimation.

Stratification works by keeping the balance or ratio between labels/targets of the dataset. So if your whole dataset has two labels (e.g. Positive and Negative) and these have a 30/70 ratio, and you split in 10 subsamples, each stratified subsample should keep the same ratio. Reasoning: Because machine-learned model performance in general are very sensitive to a sample balancing, using this strategy often makes the models more stable for subsamples.

Split vs Random Split. A split is just a split, usually for the purpose of having separate training and testing subsamples. But taking the first X% for a subsample and the remaining for another subsample might not be a good idea because it can introduce very high bias. There is where random split comes into play, by introducing randomness for the subsampling.

K-fold Cross-validation vs Random Split. K folds comprises in creating K subsamples. Because you now have a more significant number of samples (instead of 2), you can separate one of the subsamples for testing and the remaining subsamples for training, do this for every possible combination of testing/training folds and average the results. This is known as cross-validation. Doing K-fold cross-validation is like doing a (not random) split k times, and then averaging. A small sample might not benefit from k-fold cross-validation, while a large sample usually always do benefit from cross-validation. A random split is a more efficient (faster) way of estimating, but might be more prone to sampling bias than k-fold cross-validation. Combining stratification and random split is an attempt to have an effective and efficient sampling strategy that preserves label distribution.

Rabbit
  • 846
  • 6
  • 9
  • Thank you for your answer. I will be using Stratified cross validation + random split or whichever gives me the best accuracy. – Aizzaac Jun 08 '16 at 22:29
  • 4
    @Rabbit this answer doesn't address StratifiedShuffleSplit, which appears to do cross validation which (unlike k-fold) resamples for each CV batch. It's unclear to me why this would be better or worse than k-fold, which I think was the original question? – James Sep 17 '18 at 01:27
-1
  1. StratifiedKFold: Here I shuffle both arrays BUT keep each row with its label.
  2. StratifiedKFOld + Shuffle: Here I shuffle both arrays before the cross-validation. So each row is not longer linked to its label. That is why the accuracy is so bad compared to 1.
  3. StratifiedShuffleSplit: Here the accuracy is still bad and the same as 2, because the arrays were already shuffled by 2 and therefore there is no longer a link between the rows and their labels. But when I ran it "standalone" the accuracy was as good as 1. So basically 1 and 3 do the same.
Aizzaac
  • 3,146
  • 8
  • 29
  • 61
  • 2
    StratifiedKFOLD + Shuffle does NOT de-link your feature to its label. Any delinking with labels doesn't make sense in terms of supervised learning. – Heavy Breathing Apr 25 '18 at 20:16