1

Is there a Python (perhaps pandas) equivalent to R's

install.packages("caTools")
library(caTools)
set.seed(88)
split = sample.split(df$col, SplitRatio = 0.75)

that will generate exactly the same value split?


My current context for this is, as an example getting Pandas dataframes that correspond exactly to the R dataframes (qualityTrain, qualityTest) created by:

# https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv
quality = read.csv("quality.csv")
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
Cath
  • 23,906
  • 5
  • 52
  • 86
orome
  • 45,163
  • 57
  • 202
  • 418

4 Answers4

2

I think scikit-learn's train_test_split function might work for you (link).

import pandas as pd
from sklearn.cross_validation import train_test_split

url = 'https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv'
quality = pd.read_csv(url)

train, test = train_test_split(quality, train_size=0.75, random_state=88)

qualityTrain = pd.DataFrame(train, columns=quality.columns)
qualityTest = pd.DataFrame(test, columns=quality.columns)

Unfortunately I don't get the same rows as the R function. I'm guessing it's the seeding, but could be wrong.

Greg
  • 6,791
  • 3
  • 18
  • 20
  • That doesn't work for me: I get `np.sum(test_mask) + np.sum(train_mask)` that's not the same as `len(quality)`. – orome Mar 19 '14 at 17:19
  • @raxacoricofallapatorius - I just realized my (big) error. It should be correct now. – Greg Mar 19 '14 at 17:27
  • Thanks. Any idea why I get `ValueError: operands could not be broadcast together with shapes (98,2) (98)` when I `sm.Logit.from_formula('PoorCare ~ OfficeVisits + Narcotics', qualityTrain2).fit()`? – orome Mar 19 '14 at 17:33
  • And you're right: the rows selected are completely different from the ones R selects. Any thoughts on how to get the same rows (other than generating the mask in R and transferring it over)? – orome Mar 19 '14 at 17:34
  • 1
    @raxacoricofallapatorius - Let me get back to you on the statsmodels stuff. Regarding the row selection, I'm not sure of another way. It looks like seed numbers don't generate the same results across programming languages ([see this](http://stackoverflow.com/questions/4045579/random-numbers-across-different-programming-languages)) – Greg Mar 19 '14 at 17:51
  • @raxacoricofallapatorius - I believe you are having the statsmodels issue because the `PoorCare` variable is of type Object. I think `qualityTrain2.PoorCare = qualityTrain2.PoorCare.astype(int)` should make it work (at least it did for me). Let me know if you are still having issues. – Greg Mar 20 '14 at 14:44
1

Splitting with sample.split from caTools library means the class distribution is preserved. Scikit-learn method train_test_split does not guarantee that (it splits dataset into a random train and test subsets).

You can get equivalent result as R caTools library (regarding class distribution) by using instead sklearn.cross_validation.StratifiedShuffleSplit

sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0)
for train_index, test_index in sss:
    qualityTrain = quality.iloc[train_index,:]
    qualityTest = quality.iloc[test_index,:]
noleto
  • 1,534
  • 16
  • 12
  • you can now pass an additional argument `stratify` to train_test_split method. It means the class labels array used to stratify the split. see more => http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html – noleto Jun 29 '18 at 10:54
0

I know this is an old thread but I just found it looking for any potential solution because for a lot of online classes in stats and machine learning that are taught in R, if you want to use Python you run into this issue where all the classes say to do a set.seed() in R and then you use something like the caTools sample.split and you must get the same split or your result won't be the same later and you can't get the right answer for some quiz or exercise question. One of the main issues is that although both Python and R use, by default, the Mercenne Twister algorithm for their pseudo-random number generation, I discovered, by looking at the random states of their respective prngs, that they won't produce the same result given the same seed. And one (I forget which) is using signed numbers and the other unsigned, so it seems like there's little hope that you could find a seed to use with Python that would produce the same series of numbers as R.

yatinla
  • 96
  • 6
0

A small correction in the above, StatifiedShuffleSplit is now part of sklearn.model_selection.

I have a some data with X and Y in different numpy arrays. The distribution of 1s against 0s in my Y array is about 4.1%. If I use StatifiedShuffleSplit it maintains this distribution in test and train set made after wards. See below.

full_data_Y_np.sum() / len(full_data_Y_np)
0.041006701187937859 
for train_index, test_index in sss.split(full_data_X_np, full_data_Y_np):
    X_train = full_data_X_np[train_index] 
    Y_train = full_data_Y_np[train_index] 
    X_test = full_data_X_np[test_index] 
    Y_test = full_data_Y_np[test_index] 
Y_train.sum() / len(Y_train) 
0.041013925152306355 
Y_test.sum() / len(Y_test) 
0.040989847715736043
Anugraha Sinha
  • 621
  • 6
  • 13