I've used train_test_split()
numerous times with index slicing, but for some reason it's retaining the predictor values for both y train and test sets. Below is example data, along with train/test slicing and shapes.
Original data example:
nypd_dummy.head(3
borough status
start
2016 BRONX ATTEMPTED
2017 BROOKLYN ATTEMPTED
2018 BRONX COMPLETED
Example data:
nypd_dummies = pd.get_dummies(nypd_dummy)
nypd_dummies.head(3)
borough_BRONX borough_BROOKLYN status_ATTEMPTED status_COMPLETED
start
2016 1 0 1 0
2017 0 1 1 0
2018 1 0 0 1
X_dummies = nypd_dummies.iloc[:, 2:]
y_dummies = nypd_dummies.iloc[:, :2]
xtrain_dummy, xtest_dummy, ytrain_dummy, ytest_dummy = train_test_split(X_dummies, y_dummies, test_size=0.3)
print 'x train:', xtrain_dummy.shape, 'x test:', xtest_dummy.shape
print 'y train:', ytrain_dummy.shape, 'y test:', ytest_dummy.shape
x train: (3, 2) x test: (1, 2)
y train: (3, 2) y test: (1, 2)
Ultimatel I'm aiming to create a model that predicts the borough - is it not slicing correctly because I'm pulling predictor values from multiple columns as opposed to one single output?