3

There is a dataframe that consists of 14 columns in total, the last column is the target label with integer values = 0 or 1.

I have defined:

  1. X = df.iloc[:,1:13] ---- this consists of the feature values
  2. y = df.iloc[:,-1] ------ this consists of the corresponding labels

Both have the same length as desired, X is the dataframe that consists of 13 columns, shape (159880, 13), y is an array type with shape(159880,)

But when I perform train_test_split() on X,y- the function is not working properly.

Below is the straightforward code:

X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)

After this split, both X_train and X_test have shape (119910,13). y_train is having shape (39970,13) and y_test is having shape (39970,)

This is weird, even after defining test_size parameter, the results stay the same.

Please advise, what could have been going wrong.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_feature_importances
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def model():
    
    df = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
    df = df[np.isfinite(df['compliance'])]
    df = df.fillna(0)
    df['compliance'] = df['compliance'].astype('int')
    df = df.drop(['grafitti_status', 'violation_street_number','violation_street_name','violator_name',
                  'inspector_name','mailing_address_str_name','mailing_address_str_number','payment_status',
                  'compliance_detail', 'collection_status','payment_date','disposition','violation_description',
                  'hearing_date','ticket_issued_date','mailing_address_str_name','city','state','country',
                  'violation_street_name','agency_name','violation_code'], axis=1)
    df['violation_zip_code'] = df['violation_zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
    df['zip_code'] = df['zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
    df['non_us_str_code'] = df['non_us_str_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
    df['violation_zip_code'] = pd.to_numeric(df['violation_zip_code'], errors='coerce')
    df['zip_code'] = pd.to_numeric(df['zip_code'], errors='coerce')
    df['non_us_str_code'] = pd.to_numeric(df['non_us_str_code'], errors='coerce')
    #df.violation_zip_code = df.violation_zip_code.replace('-','', inplace=True)
    df['violation_zip_code'] = np.nan_to_num(df['violation_zip_code'])
    df['zip_code'] = np.nan_to_num(df['zip_code'])
    df['non_us_str_code'] = np.nan_to_num(df['non_us_str_code'])
    X = df.iloc[:,0:13]
    y = df.iloc[:,-1]
    X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)    
    print(y_train.shape)
Mario
  • 1,631
  • 2
  • 21
  • 51
Nakul Sharma
  • 143
  • 2
  • 9

2 Answers2

6

You have mixed up the results of train_test_split, it should be

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
stephen_mugisha
  • 889
  • 1
  • 8
  • 18
Gambit1614
  • 8,547
  • 1
  • 25
  • 51
-1
if args.mode == "train":

    # Load Data
    data, labels = load_dataset('C:/Users/PC/Desktop/train/k')

    # Train ML models
    knn(data, labels,'C:/Users/PC/Desktop/train/knn.pkl' )
Muhammad Dyas Yaskur
  • 6,914
  • 10
  • 48
  • 73
  • Your answer is unclear. Please [edit](https://stackoverflow.com/posts/62238326/edit) to add additional details that will help others understand how this addresses the question asked by OP. You can find more information on how to write [good answers](https://stackoverflow.com/help/how-to-answer) in the help center. – Mario Feb 21 '22 at 07:58