I've followed Andrew Ng's machine learning course and tried to reproduce some of the examples in python SciKit.
I'm trying to understand the effect of regulation parameter C. The problem I'm constantly running into is easiest to visualise with the following:
for c in range(1,10):
c = c/10.
print( c )
classifier = LogisticRegression(C=c , max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
I'm expecting to see higher accuracy with lower values of C. However, the results I'm getting are a bit biased:
0.1 .. 0.9
[0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202]
Running this outside the loop:
classifier = LogisticRegression(C=0.0001,max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred)
results in 0.77653631284916202 as well.
Kernel restart seems to be the only way to "reset" the classifier. After restart and loading of the data, running the code above gives the expected higher value: 0.8044692737430168.
Is this expected behaviour? Or am I abusing python/scikit?
I'm on osx 10.13.2. Tried this in IPython directly from terminal as well the (Anaconda) Spyder and (Anaconda) Notebook Jypiter (Anaconda2-5.0.1 and Anaconda3-5.0.1). All packages are up-to-date.
Edit: Upon request, the complete code below. Train.csv can be downloaded from Kaggle Titanic dataset.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('train.csv')
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['Sex'] = le.fit_transform(dataset['Sex'])
le2 = preprocessing.LabelEncoder()
dataset['Embarked'] = le2.fit_transform(dataset['Embarked'].astype(str) )
# remove NaN in age coloumn
imputer = preprocessing.Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(dataset['Age'].reshape(-1, 1))
dataset['Age'] = imputer.transform(dataset['Age'].reshape(-1, 1))
y = dataset.iloc[:, 1].values
X = dataset.iloc[:, [2,4,5,6,7,9,11]]
X[0:5]
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn import metrics
for c in range(1,10):
c = c/10.
print( c )
classifier = LogisticRegression(C=c , random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))