0

I've used train_test_split() numerous times with index slicing, but for some reason it's retaining the predictor values for both y train and test sets. Below is example data, along with train/test slicing and shapes.

Original data example:

nypd_dummy.head(3

      borough   status
start 
2016  BRONX     ATTEMPTED
2017  BROOKLYN  ATTEMPTED
2018  BRONX     COMPLETED

Example data:

    nypd_dummies = pd.get_dummies(nypd_dummy)
    nypd_dummies.head(3)

          borough_BRONX borough_BROOKLYN status_ATTEMPTED status_COMPLETED
start     
2016      1             0                1                0
2017      0             1                1                0                
2018      1             0                0                1

X_dummies = nypd_dummies.iloc[:, 2:]
y_dummies = nypd_dummies.iloc[:, :2]
xtrain_dummy, xtest_dummy, ytrain_dummy, ytest_dummy = train_test_split(X_dummies, y_dummies, test_size=0.3)

print 'x train:', xtrain_dummy.shape, 'x test:', xtest_dummy.shape
print 'y train:', ytrain_dummy.shape, 'y test:', ytest_dummy.shape

x train: (3, 2) x test: (1, 2)
y train: (3, 2) y test: (1, 2)

Ultimatel I'm aiming to create a model that predicts the borough - is it not slicing correctly because I'm pulling predictor values from multiple columns as opposed to one single output?

Mr. Jibz
  • 511
  • 2
  • 7
  • 21
  • 1
    Can you provide a little piece of `.txt` that one can load with pandas, then explain clearly what's the output you have, and what should have been the expected output ? To me, x/y train/test seems correct. But I might have misunderstood : hence the need for clarification on my part. – IMCoins Aug 22 '18 at 16:16
  • i added original code example; file derives from csv fomat – Mr. Jibz Aug 22 '18 at 16:27

1 Answers1

0

Your code will produce Ys dataframes with the following structure (for both train and test):

       borough_BRONX  borough_BROOKLYN
start                                 
2018               1                 0

Since you wish to have a single estimation (predicting 'borough' class), you probably want to have a single label column. Here is a tutorial on dealing with categorical data in pandas.

zohar.kom
  • 1,765
  • 3
  • 12
  • 28