Anomaly in test data-set for carrying out multivariate regression in python

Question

I have a dataset (train, test and result) which consists of 32 Independent Variables and 5 Dependent Variables. To get a grasp of the data, I am trying to build a simple linear regression model on it and test its performance.

But all the rows of dependent variables in the test data set are filled with "?".

Salary DOJ DOL Designation JobCity ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Out of these 5 dependent variables, I am trying to predict the Salary (with the help of 32 independent variables).

When I implement this code:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression

def get_data(file_name):
    train = pd.read_excel("C:/Users/Shubhanshu/Desktop/train.xlsx")
    X_train = train.drop(['ID', 'Salary'], axis=1)
    #Keeping only numeric data
    X_train = X_train._get_numeric_data()
    y_train = train.Salary
    return y_train, X_train
Y, X = get_data("C:/Users/Shubhanshu/Desktop/train.xlsx")

clf = LinearRegression()
clf = clf.fit(X, Y)

def get_test_data(file_name):
    test = pd.read_excel("C:/Users/Shubhanshu/Desktop/test.xlsx")
    X_test = test.drop(['ID', 'Salary'], axis=1)
    #Keeping only numeric data
    X_test = X_test._get_numeric_data()
    y_test = test.Salary
    return X_test, y_test

X1, Y1 = get_test_data("C:/Users/Shubhanshu/Desktop/test.xlsx")
r_sqr = clf.score(X1, Y1)
y_pred = clf.predict(X1)

I get the error: ValueError: could not convert string to float: ?

After searching thoroughly on the material available and the tutorials, I guess I can replace all the '?' with let's say a scalar value such as 0. But, won't it effect the model then (I am sorry, but this is more of a first hands-on project with ML and a proper dataset, so please forgive my ignorance in the matter)? Or else, how should I proceed ahead in such a case?

P.S: I tried this:

test = pd.read_excel("C:/Users/Shubhanshu/Desktop/test.xlsx")
X_test = test.drop(['ID', 'Salary'], axis=1)
#Keeping only numeric data

#test.dtypes
test.Salary = test.Salary.fillna('missing')
test.DOJ = test.DOJ.fillna('missing')
test.DOL = test.DOL.fillna('missing')
test.Designation = test.Designation.fillna('missing')
test.JobCity = test.JobCity.fillna('missing')
#test = test.Salary.fillna('missing')
y_test = test.Salary
#test.info
X_test.DOJ = test.DOJ.fillna('missing')
X_test.DOL = test.DOL.fillna('missing')
X_test.Designation = test.Designation.fillna('missing')
X_test.JobCity = test.JobCity.fillna('missing')
X_test = X_test._get_numeric_data()
#X_test
X_test = X_test.convert_objects(convert_numeric=True)
y_pred = clf.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print mse

But obviously, since y_test has all "missing" written in each of its cell, mse or mae won't return anything meaningful. This is what I got in return:

ValueError                                
Traceback (most recent call last)
<ipython-input-97-b384f3be69ab> in <module>()
----> 1 mae = mean_absolute_error(y_test, y_pred)
      2 mse = mean_squared_error(y_test, y_pred)
      3 print mse

F:\PythonIDE\lib\site-packages\sklearn\metrics\regression.pyc in mean_absolute_error(y_true, y_pred, sample_weight)
    137 
    138     """
--> 139     y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
    140     return np.average(np.abs(y_pred - y_true).mean(axis=1),
    141                       weights=sample_weight)

F:\PythonIDE\lib\site-packages\sklearn\metrics\regression.pyc in _check_reg_targets(y_true, y_pred)
     56     """
     57     check_consistent_length(y_true, y_pred)
---> 58     y_true = check_array(y_true, ensure_2d=False)
     59     y_pred = check_array(y_pred, ensure_2d=False)
     60 

F:\PythonIDE\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
    342             else:
    343                 dtype = None
--> 344         array = np.array(array, dtype=dtype, order=order, copy=copy)
    345         # make sure we actually converted to numeric:
    346         if dtype_numeric and array.dtype.kind == "O":

ValueError: could not convert string to float: missing

I would be really grateful for any help in this regard.

score 0 · Answer 1 · answered Dec 23 '15 at 00:08

0

In future could you please include the full traceback in your question rather than just the last line? I suspect that the problem is here:

r_sqr = clf.score(X1, Y1)

Here, Y1 represents the dependent variables for the test dataset, which just consist of "?" nonsense values.

I think there is a conceptual misunderstanding here - what you're doing just doesn't make sense. There's no way you could possibly compute a score for the test dataset, since you don't know the true values of the dependent variables.

answered Dec 23 '15 at 00:08

ali_m

71,714
23
223
298

I have added the traceback. Also, even I feel that having such non-sense values in the test data doesn't make sense. But this is what is there in the "test" dataset provided (separately with training data). – Shubhanshu Dec 26 '15 at 07:29
I'm not sure what else I can say - computing a mean squared error score without having real values for the dependent variable is impossible (although `clf.predict(X1)` should work fine). – ali_m Dec 26 '15 at 09:43

Anomaly in test data-set for carrying out multivariate regression in python

1 Answers1