1

I've followed Andrew Ng's machine learning course and tried to reproduce some of the examples in python SciKit.

I'm trying to understand the effect of regulation parameter C. The problem I'm constantly running into is easiest to visualise with the following:

for c in range(1,10):
    c = c/10.
    print( c )
    classifier = LogisticRegression(C=c , max_iter=10000)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(metrics.accuracy_score(y_test, y_pred))

I'm expecting to see higher accuracy with lower values of C. However, the results I'm getting are a bit biased:

0.1 .. 0.9
[0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202]

Running this outside the loop:

classifier = LogisticRegression(C=0.0001,max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

results in 0.77653631284916202 as well.

Kernel restart seems to be the only way to "reset" the classifier. After restart and loading of the data, running the code above gives the expected higher value: 0.8044692737430168.

Is this expected behaviour? Or am I abusing python/scikit?

I'm on osx 10.13.2. Tried this in IPython directly from terminal as well the (Anaconda) Spyder and (Anaconda) Notebook Jypiter (Anaconda2-5.0.1 and Anaconda3-5.0.1). All packages are up-to-date.

Edit: Upon request, the complete code below. Train.csv can be downloaded from Kaggle Titanic dataset.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('train.csv')

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['Sex'] = le.fit_transform(dataset['Sex'])
le2 = preprocessing.LabelEncoder()
dataset['Embarked'] = le2.fit_transform(dataset['Embarked'].astype(str) )

# remove NaN in age coloumn
imputer = preprocessing.Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(dataset['Age'].reshape(-1, 1))
dataset['Age'] = imputer.transform(dataset['Age'].reshape(-1, 1))

y = dataset.iloc[:, 1].values
X = dataset.iloc[:, [2,4,5,6,7,9,11]]
X[0:5]

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn import metrics
for c in range(1,10):
    c = c/10.
    print( c )
    classifier = LogisticRegression(C=c , random_state = 42)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(metrics.accuracy_score(y_test, y_pred))
  • And you assume this does not depend on X_train/y_train? – sascha Dec 23 '17 at 10:02
  • Please elaborate. The dataset is good, I've used Titanic dataset here. When running the loop manually (restarting kernel, setting C, running the code, restarting kernel, setting C, running the code...), I'm seeing different values for accuracy. – user45237841 Dec 23 '17 at 10:16
  • 1
    Add reproducible code. Make sure you understand your params (missing random_state; unused max_iter). In it's current form, i even don't get the question. – sascha Dec 23 '17 at 10:17
  • 1
    train_test_split picked different datasets, that caused different results on Kernel restart. Thanks for the help! – user45237841 Dec 23 '17 at 10:37
  • 1
    Small addition. Your train and test sets may be unbalanced. You have to split the data in that way that you have equal proportions of 0 and 1 class. Try to use GridSearchCV with cv as StratifiedKFold. – avchauzov Dec 24 '17 at 11:22
  • excellent point, thank you! – user45237841 Dec 25 '17 at 07:37
  • May it's a very stupid question, but... *Why* do you expect more accuracy with smaller regularization parameter(C)? You can not reduce it unlimitedly, as I understand, with very low values your classifier will become just like a line (line in several dimensions), willn't it? – sergzach Dec 25 '17 at 20:56
  • Different values - probably because you do not initialize random generator into your specific constant. – sergzach Dec 25 '17 at 21:04
  • @sergzach Exactly, only for this specific function low value correlates with high regularisation. So, I was expecting to see some change in result. However, I often didn't see any difference while varying the parameter. It was a flawed since I didn't fix the randomisation parameter. – user45237841 Jan 17 '18 at 08:31

0 Answers0