Using train_test_split to generate test and train data causes changes in underlying data

Question

I am using trai_test_split from sklearn.cross_validation to split the source CSV data file into training and test data using simple Python code like this:

from sklearn.cross_validation import train_test_split
import pandas as pd

dataset = pd.read_csv(fPath + 'source.csv')

train, test = train_test_split(dataset, test_size = 0.2)
train.to_csv(fPath + "train.csv", index=False, index_label=False, header=False)
test.to_csv(fPath + "test.csv", index=False, index_label=False, header=False)

The data is split correctly in terms of proportions and randomization, but I noticed that the resulting underlying data in the newly generated TEST and TRAIN files is slightly different from the original source data, when re-combined and compared side-by-side. Not for every single row and cell, but for some, here and there, there are small, but significant differences.

The below input and output is hard to read, but it's an example of the original data (18 lines), and the combined TEST and TRAIN output data. I sorted all data by first column and then the differences are shown below. These are % numbers, so as you can see, the differences are small and random, but not insignificant. Is this expected?

-0.00095    -0.00048    -0.14%  -0.00109    -0.00011    -0.00015    0.00016
-0.00055    0.00021 0.06%   0.0006  0.00075 0.00086 0.00076
-0.00044    -0.00034    -0.10%  -0.00112    -0.00123    -0.00127    -0.00124
-0.00027    -0.00023    -0.02%  -0.00187    -0.0028 -0.00286    -0.00182
-0.00021    -0.00024    0.07%   0.0016  0.00166 0.00022 0.00044
-6.00E-05   -6.00E-05   0.01%   1.00E-05    -4.00E-05   0.00013 0.00099
-5.00E-05   0.00016 0.01%   -0.00019    5.00E-05    0.00039 4.00E-05
-2.00E-05   -1.00E-05   0.04%   0.0004  0.00053 0.0009  0.00114
2.00E-05    4.00E-05    -0.05%  -0.00205    -0.00285    -0.00151    -0.00206
8.00E-05    -0.00048    0.00%   0.00038 0.00114 0.00111 0.00112
8.00E-05    0.00147 0.04%   0.00037 0.00033 0.00029 0.00021
8.00E-05    4.00E-05    -0.02%  -0.00027    -0.00018    -0.00015    -0.00014
8.00E-05    -1.00E-05   -0.02%  0   -3.00E-05   -0.00078    -0.00125
0.00015 -0.0001 -0.07%  -0.0004 -0.00114    -0.00099    -0.00071
0.00017 0.00043 0.11%   0.00044 0.00027 -6.00E-05   -4.00E-05
0.00029 0.00019 0.08%   0.00112 0.00167 -0.0019 -0.0014
0.00054 0.00063 0.08%   0.00088 0.00095 0.00097 0.00046
0.00086 -6.00E-05   -0.05%  -0.00028    0.00012 -0.0007 -0.00215
0.00115 0.00221 0.03%   -0.00033    0.00011 -0.00078    -0.00076

-0.00095    -0.00048    -0.14%  -0.00109    -0.00011    -0.00015    0.00016
-0.00055    0.00021 0.06%   0.0006  0.00075 0.00086 0.00076
-0.00044    -0.00034    -0.10%  -0.00112    -0.00123    -0.00127    -0.00124
-0.00021    -0.00024    0.07%   0.0016  0.00166 0.00022 0.00044
-6.00E-05   -6.00E-05   0.01%   1.00E-05    -4.00E-05   0.00013 0.00099
-5.00E-05   0.00016 0.01%   -0.00019    5.00E-05    0.00039 4.00E-05
-2.00E-05   -1.00E-05   0.04%   0.0004  0.00053 0.0009  0.00114
2.00E-05    4.00E-05    -0.05%  -0.00205    -0.00285    -0.00151    -0.00206
8.00E-05    -0.00048    0.00%   0.00038 0.00114 0.00111 0.00112
8.00E-05    4.00E-05    -0.02%  -0.00027    -0.00018    -0.00015    -0.00014
8.00E-05    0.00147 0.04%   0.00037 0.00033 0.00029 0.00021
8.00E-05    -1.00E-05   -0.02%  0   -3.00E-05   -0.00078    -0.00125
0.00015 -0.0001 -0.07%  -0.0004 -0.00114    -0.00099    -0.00071
0.00017 0.00043 0.11%   0.00044 0.00027 -6.00E-05   -4.00E-05
0.00029 0.00019 0.08%   0.00112 0.00167 -0.0019 -0.0014
0.00054 0.00063 0.08%   0.00088 0.00095 0.00097 0.00046
0.00086 -6.00E-05   -0.05%  -0.00028    0.00012 -0.0007 -0.00215
0.00115 0.00221 0.03%   -0.00033    0.00011 -0.00078    -0.00076

0.00%   0.00%   0.00%   0.00%   0.00%   0.00%   0.00%
0.00%   0.00%   0.00%   0.00%   0.00%   0.00%   0.00%
0.00%   0.00%   0.00%   0.00%   0.00%   0.00%   0.00%
-0.01%  0.00%   -0.08%  -0.35%  -0.45%  -0.31%  -0.23%
-0.02%  -0.02%  0.06%   0.16%   0.17%   0.01%   -0.06%
0.00%   -0.02%  0.00%   0.02%   -0.01%  -0.03%  0.10%
0.00%   0.02%   -0.03%  -0.06%  -0.05%  -0.05%  -0.11%
0.00%   -0.01%  0.09%   0.25%   0.34%   0.24%   0.32%
-0.01%  0.05%   -0.05%  -0.24%  -0.40%  -0.26%  -0.32%
0.00%   -0.05%  0.02%   0.07%   0.13%   0.13%   0.13%
0.00%   0.00%   0.00%   0.00%   0.00%   0.00%   0.00%
0.00%   0.01%   0.00%   -0.03%  -0.02%  0.06%   0.11%
-0.01%  0.01%   0.05%   0.04%   0.11%   0.02%   -0.05%
0.00%   -0.05%  -0.18%  -0.08%  -0.14%  -0.09%  -0.07%
-0.01%  0.02%   0.03%   -0.07%  -0.14%  0.18%   0.14%
-0.03%  -0.04%  -0.01%  0.02%   0.07%   -0.29%  -0.19%
-0.03%  0.07%   0.13%   0.12%   0.08%   0.17%   0.26%
-0.03%  -0.23%  -0.07%  0.01%   0.00%   0.01%   -0.14%

You split the source file into two files using 'train_test_split' Then you manually combine the newly created two files and compare the data in them to the data in the original source file, just to double-check, the data in the newly created files is not exactly the same as the data in the original file. I sorted the original data and the re-combined by first column (arbitrary choice) just to bring everything into order, since the split occurs randomly, and then did the simple cellXX-cellYY in exel to see how similar/different the data is, and it's not the same. — VS_FF, Feb 07 '17 at 10:28
You have shown 3 chunks of data here. First have 19 lines, second and third have 18 lines. I wanted to know what it is? — Vivek Kumar, Feb 07 '17 at 10:35
Also, leaving the 1 extra line in the first chunk, first and second chunk are exact same, (two rows are interchanged, but thats because of your sorting, thier 1st column value is same). What are the differences you are getting. — Vivek Kumar, Feb 07 '17 at 10:37
You are absolutely right. It looks like for some reason 1 row was lost during the split and that's what broke the sort by a column with non-zero values and therefore the comparison of the first XY cells. Im not sure why the split loses one row -- either because the source file has odd num rows and the param was .2 or maybe because it interprets the first row as headers by default? I'll look into this... — VS_FF, Feb 07 '17 at 10:53
No, it could not be lost in `train_test_split`. With 19 rows and test_size=0.2, it will generate 15 train and 4 test rows — Vivek Kumar, Feb 07 '17 at 11:05
yes, pd.read_csv() considers first row as header. Pass `header=None` in the read_csv() command — Vivek Kumar, Feb 07 '17 at 11:09

Using train_test_split to generate test and train data causes changes in underlying data

0 Answers0