0

Going through a kaggle tutorial the now, while I get the basic idea of what it does, from looking at the output and reading up the documentation, I think I need confirmation of what is going on here:

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []

for train, test in kf:
     train_predictors = (titanic[predictors].iloc[train,:])

My main issue here is the last line with the iloc function. The rest is just for context. It just splits the training data up?

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
PurpleCoffee
  • 35
  • 1
  • 2
  • 10
  • You could've looked at the [docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position) and printed out `train` no? – EdChum Dec 10 '15 at 11:37
  • @EdChum yeah I am looking at docs, and honestly I printed out train_predictors to see how that was changed. If I print train on its own before and after that line, its the same, since that wasn't changed outside the train_predictors variable no? – PurpleCoffee Dec 10 '15 at 11:44
  • your indentation is off, shouldn't `train_predictors = (titanic[predictors].iloc[train,:])` be indented? – EdChum Dec 10 '15 at 11:45
  • Yeah, it is in my code. I forgot to do it when putting it on here. Wasn't a straight copy paste since I have a lot of commenting. – PurpleCoffee Dec 10 '15 at 11:46

1 Answers1

2

.iloc[] is the primary method to access row and column index of pandas DataFrames (or Series, in this case index only). It is quite well explained in the Indexing docs.

In this specific case, from the scikit-learn docs:

KFold divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k - 1 folds, and the fold left out is used for test. Example of 2-fold cross-validation on a dataset with 4 samples:

import numpy as np
from sklearn.cross_validation import KFold

kf = KFold(4, n_folds=2)
for train, test in kf:
    print("%s %s" % (train, test)) 
[2 3] [0 1] [0 1] [2 3]

In other words, KFold picks the index positions, these are used in the for loop over kf and passed to .iloc so that is selects the appropriate row index (and all columns) from the titanic[predictors] DataFrame containing the training set.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • are these the same `newDf = df.iloc[:-1]` and `newDf = df[:-1]`? – whytheq Apr 03 '16 at 18:01
  • iloc is specifically for integer indexing as opposed to indexing by label. – Stefan Apr 03 '16 at 19:18
  • Did this answer your question after all? – Stefan May 15 '16 at 21:06
  • yes that helps me a lot - do you know of any good pandas tutorials - specifically the basics around the dataframe object? (I've upped your answer - although not for me to mark it as correct though as I'm not the original questioner) – whytheq May 15 '16 at 21:32
  • Hi, thanks. It really pay to carefully read through the pandas docs: http://pandas.pydata.org/pandas-docs/stable/. There's also Wes McKinney's (original pandas author) book - here's the github repo with related notebooks: https://github.com/wesm/pydata-book. – Stefan May 15 '16 at 21:38
  • oh yeah - very nice: I'll export some of those notebook and play - cheers. I'm still struggling with the basics of DataFrames – whytheq May 15 '16 at 21:53