No, the two libraries do not give the same results for those two code snippets. The scikit-learn function uses a random permutation to shuffle the data, then splits the data into the desired fraction. The SFrame.random_split
method is different; it randomly samples rows from the original data based on the specified fraction.
Not only that, the random number generators for the two libraries are different, so setting the random state and seed to the same value won't have any effect.
I verified this with GraphLab Create 1.7.1 and Scikit-learn 0.17.
import numpy as np
import graphlab as gl
from sklearn.cross_validation import train_test_split
sf = graphlab.SFrame(np.random.rand(10, 1))
sf = sf.add_row_number('row_id')
sf_train, sf_test = sf.random_split(0.6, seed=0)
df_train, df_test = train_test_split(sf.to_dataframe(),
test_size=0.4,
random_state=0)
sf_train
is:
+--------+-------------------+
| row_id | X1 |
+--------+-------------------+
| 0 | [0.459467634448] |
| 4 | [0.424260273035] |
| 6 | [0.143786736949] |
| 7 | [0.0871068666212] |
| 8 | [0.74631952689] |
| 9 | [0.37570258651] |
+--------+-------------------+
[6 rows x 2 columns]
while df_train
looks like:
row_id X1
1 1 [0.561396445174]
6 6 [0.143786736949]
7 7 [0.0871068666212]
3 3 [0.397315891635]
0 0 [0.459467634448]
5 5 [0.033673713722]
Definitely not the same.