I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code.
Case 1: 60K datapoints
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output: (40200, 8) (40200,) (19800, 8) (19800,)
Case 2:109,000 data-points
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output: (109248, 9) (90552, 8) (90552,) (1460, 8) (1460,)
Anything more than 60K data-points is being abruptly like in case 2 into 90K and 1.4K. I've tried changing random state, removing random state,moving data set to new location but the issue seems same.