Why is my y_pred model only close to zero?

Question

I am new to python and also learning machine learning. I got a data-set for titanic and trying to predict who survived and who did not. But my code seems to have an issue with the y_pred, as none of them is close to 1 or above one. Find attached also the y_test and y_pred images.

    # Importing the libraries
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd

    # Importing the dataset
    dataset = pd.read_csv('train.csv')
    X = dataset.iloc[:, :-1].values
    y = dataset.iloc[:, 3].values

    # Taking care of missing data
    from sklearn.preprocessing import Imputer
    imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
    imputer = imputer.fit(X[:, 2:3])
    X[:, 2:3] = imputer.transform(X[:, 2:3])

    #Encoding Categorical variable
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    labelencoder_X = LabelEncoder()
    X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
    onehotencoder = OneHotEncoder(categorical_features = [0])
    X = onehotencoder.fit_transform(X).toarray()

    # Dummy variable trap
    X = X[:, 1:]

    # Splitting the Dataset into Training Set and Test Set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

    # Split the dataset into training and test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_tratin, y_test = train_test_split(X, y, test_size = 0.2,)

    # Fitting the Multiple Linear Regression to the training set
    """ regressor is an object of LinearRegression() class in line 36 """
    from sklearn.linear_model import LinearRegression
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)

y_pred

It would be helpful to see this data-set to determine what you are trying to do and what may be going wrong. — px06, Apr 09 '18 at 17:17
Also is there any reason you are splitting using `train_test_split` twice? Fyi it looks that you also have a typo in the second instance `y_traitin` should be `y_train`. — px06, Apr 09 '18 at 17:20
If you are trying to predict who survived or not, shouldn't that be a `classification` problem instead of a `regression` problem? I mean, why not use a model that returns 0 or 1 (such as trees?) , instead of a continuous number — rafaelc, Apr 09 '18 at 17:23

score 0 · Answer 1 · answered Apr 09 '18 at 20:20

Thanks for the help everyone, I have been able to sort it out. The problem was y in the importing dataset was seen as a vector and not a matrix

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('train.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3:].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 2:3])
X[:, 2:3] = imputer.transform(X[:, 2:3])

#Encoding Categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

# Dummy variable trap
X = X[:, 1:]

# Splitting the Dataset into Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Fitting the Multiple Linear Regression to the training set
""" regressor is an object of LinearRegression() class in line 36 """
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set result
y_pred = regressor.predict(X_test)

Why is my y_pred model only close to zero?

1 Answers1