I have a dataset (train, test and result) which consists of 32 Independent Variables and 5 Dependent Variables. To get a grasp of the data, I am trying to build a simple linear regression model on it and test its performance.
But all the rows of dependent variables in the test data set are filled with "?".
Salary DOJ DOL Designation JobCity
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
Out of these 5 dependent variables, I am trying to predict the Salary (with the help of 32 independent variables).
When I implement this code:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
def get_data(file_name):
train = pd.read_excel("C:/Users/Shubhanshu/Desktop/train.xlsx")
X_train = train.drop(['ID', 'Salary'], axis=1)
#Keeping only numeric data
X_train = X_train._get_numeric_data()
y_train = train.Salary
return y_train, X_train
Y, X = get_data("C:/Users/Shubhanshu/Desktop/train.xlsx")
clf = LinearRegression()
clf = clf.fit(X, Y)
def get_test_data(file_name):
test = pd.read_excel("C:/Users/Shubhanshu/Desktop/test.xlsx")
X_test = test.drop(['ID', 'Salary'], axis=1)
#Keeping only numeric data
X_test = X_test._get_numeric_data()
y_test = test.Salary
return X_test, y_test
X1, Y1 = get_test_data("C:/Users/Shubhanshu/Desktop/test.xlsx")
r_sqr = clf.score(X1, Y1)
y_pred = clf.predict(X1)
I get the error: ValueError: could not convert string to float: ?
After searching thoroughly on the material available and the tutorials, I guess I can replace all the '?' with let's say a scalar value such as 0. But, won't it effect the model then (I am sorry, but this is more of a first hands-on project with ML and a proper dataset, so please forgive my ignorance in the matter)? Or else, how should I proceed ahead in such a case?
P.S: I tried this:
test = pd.read_excel("C:/Users/Shubhanshu/Desktop/test.xlsx")
X_test = test.drop(['ID', 'Salary'], axis=1)
#Keeping only numeric data
#test.dtypes
test.Salary = test.Salary.fillna('missing')
test.DOJ = test.DOJ.fillna('missing')
test.DOL = test.DOL.fillna('missing')
test.Designation = test.Designation.fillna('missing')
test.JobCity = test.JobCity.fillna('missing')
#test = test.Salary.fillna('missing')
y_test = test.Salary
#test.info
X_test.DOJ = test.DOJ.fillna('missing')
X_test.DOL = test.DOL.fillna('missing')
X_test.Designation = test.Designation.fillna('missing')
X_test.JobCity = test.JobCity.fillna('missing')
X_test = X_test._get_numeric_data()
#X_test
X_test = X_test.convert_objects(convert_numeric=True)
y_pred = clf.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print mse
But obviously, since y_test has all "missing" written in each of its cell, mse or mae won't return anything meaningful. This is what I got in return:
ValueError
Traceback (most recent call last)
<ipython-input-97-b384f3be69ab> in <module>()
----> 1 mae = mean_absolute_error(y_test, y_pred)
2 mse = mean_squared_error(y_test, y_pred)
3 print mse
F:\PythonIDE\lib\site-packages\sklearn\metrics\regression.pyc in mean_absolute_error(y_true, y_pred, sample_weight)
137
138 """
--> 139 y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
140 return np.average(np.abs(y_pred - y_true).mean(axis=1),
141 weights=sample_weight)
F:\PythonIDE\lib\site-packages\sklearn\metrics\regression.pyc in _check_reg_targets(y_true, y_pred)
56 """
57 check_consistent_length(y_true, y_pred)
---> 58 y_true = check_array(y_true, ensure_2d=False)
59 y_pred = check_array(y_pred, ensure_2d=False)
60
F:\PythonIDE\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
342 else:
343 dtype = None
--> 344 array = np.array(array, dtype=dtype, order=order, copy=copy)
345 # make sure we actually converted to numeric:
346 if dtype_numeric and array.dtype.kind == "O":
ValueError: could not convert string to float: missing
I would be really grateful for any help in this regard.