I am using trai_test_split
from sklearn.cross_validation
to split the source CSV data file into training and test data using simple Python code like this:
from sklearn.cross_validation import train_test_split
import pandas as pd
dataset = pd.read_csv(fPath + 'source.csv')
train, test = train_test_split(dataset, test_size = 0.2)
train.to_csv(fPath + "train.csv", index=False, index_label=False, header=False)
test.to_csv(fPath + "test.csv", index=False, index_label=False, header=False)
The data is split correctly in terms of proportions and randomization, but I noticed that the resulting underlying data in the newly generated TEST and TRAIN files is slightly different from the original source data, when re-combined and compared side-by-side. Not for every single row and cell, but for some, here and there, there are small, but significant differences.
The below input and output is hard to read, but it's an example of the original data (18 lines), and the combined TEST and TRAIN output data. I sorted all data by first column and then the differences are shown below. These are % numbers, so as you can see, the differences are small and random, but not insignificant. Is this expected?
-0.00095 -0.00048 -0.14% -0.00109 -0.00011 -0.00015 0.00016
-0.00055 0.00021 0.06% 0.0006 0.00075 0.00086 0.00076
-0.00044 -0.00034 -0.10% -0.00112 -0.00123 -0.00127 -0.00124
-0.00027 -0.00023 -0.02% -0.00187 -0.0028 -0.00286 -0.00182
-0.00021 -0.00024 0.07% 0.0016 0.00166 0.00022 0.00044
-6.00E-05 -6.00E-05 0.01% 1.00E-05 -4.00E-05 0.00013 0.00099
-5.00E-05 0.00016 0.01% -0.00019 5.00E-05 0.00039 4.00E-05
-2.00E-05 -1.00E-05 0.04% 0.0004 0.00053 0.0009 0.00114
2.00E-05 4.00E-05 -0.05% -0.00205 -0.00285 -0.00151 -0.00206
8.00E-05 -0.00048 0.00% 0.00038 0.00114 0.00111 0.00112
8.00E-05 0.00147 0.04% 0.00037 0.00033 0.00029 0.00021
8.00E-05 4.00E-05 -0.02% -0.00027 -0.00018 -0.00015 -0.00014
8.00E-05 -1.00E-05 -0.02% 0 -3.00E-05 -0.00078 -0.00125
0.00015 -0.0001 -0.07% -0.0004 -0.00114 -0.00099 -0.00071
0.00017 0.00043 0.11% 0.00044 0.00027 -6.00E-05 -4.00E-05
0.00029 0.00019 0.08% 0.00112 0.00167 -0.0019 -0.0014
0.00054 0.00063 0.08% 0.00088 0.00095 0.00097 0.00046
0.00086 -6.00E-05 -0.05% -0.00028 0.00012 -0.0007 -0.00215
0.00115 0.00221 0.03% -0.00033 0.00011 -0.00078 -0.00076
-0.00095 -0.00048 -0.14% -0.00109 -0.00011 -0.00015 0.00016
-0.00055 0.00021 0.06% 0.0006 0.00075 0.00086 0.00076
-0.00044 -0.00034 -0.10% -0.00112 -0.00123 -0.00127 -0.00124
-0.00021 -0.00024 0.07% 0.0016 0.00166 0.00022 0.00044
-6.00E-05 -6.00E-05 0.01% 1.00E-05 -4.00E-05 0.00013 0.00099
-5.00E-05 0.00016 0.01% -0.00019 5.00E-05 0.00039 4.00E-05
-2.00E-05 -1.00E-05 0.04% 0.0004 0.00053 0.0009 0.00114
2.00E-05 4.00E-05 -0.05% -0.00205 -0.00285 -0.00151 -0.00206
8.00E-05 -0.00048 0.00% 0.00038 0.00114 0.00111 0.00112
8.00E-05 4.00E-05 -0.02% -0.00027 -0.00018 -0.00015 -0.00014
8.00E-05 0.00147 0.04% 0.00037 0.00033 0.00029 0.00021
8.00E-05 -1.00E-05 -0.02% 0 -3.00E-05 -0.00078 -0.00125
0.00015 -0.0001 -0.07% -0.0004 -0.00114 -0.00099 -0.00071
0.00017 0.00043 0.11% 0.00044 0.00027 -6.00E-05 -4.00E-05
0.00029 0.00019 0.08% 0.00112 0.00167 -0.0019 -0.0014
0.00054 0.00063 0.08% 0.00088 0.00095 0.00097 0.00046
0.00086 -6.00E-05 -0.05% -0.00028 0.00012 -0.0007 -0.00215
0.00115 0.00221 0.03% -0.00033 0.00011 -0.00078 -0.00076
0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
-0.01% 0.00% -0.08% -0.35% -0.45% -0.31% -0.23%
-0.02% -0.02% 0.06% 0.16% 0.17% 0.01% -0.06%
0.00% -0.02% 0.00% 0.02% -0.01% -0.03% 0.10%
0.00% 0.02% -0.03% -0.06% -0.05% -0.05% -0.11%
0.00% -0.01% 0.09% 0.25% 0.34% 0.24% 0.32%
-0.01% 0.05% -0.05% -0.24% -0.40% -0.26% -0.32%
0.00% -0.05% 0.02% 0.07% 0.13% 0.13% 0.13%
0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
0.00% 0.01% 0.00% -0.03% -0.02% 0.06% 0.11%
-0.01% 0.01% 0.05% 0.04% 0.11% 0.02% -0.05%
0.00% -0.05% -0.18% -0.08% -0.14% -0.09% -0.07%
-0.01% 0.02% 0.03% -0.07% -0.14% 0.18% 0.14%
-0.03% -0.04% -0.01% 0.02% 0.07% -0.29% -0.19%
-0.03% 0.07% 0.13% 0.12% 0.08% 0.17% 0.26%
-0.03% -0.23% -0.07% 0.01% 0.00% 0.01% -0.14%