1

I want to perform cross validation in logistic regression using arr as input from load_data function. I have code outline here. The function runs but does not give output.

import pandas as pd
import numpy as np
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn import cross_validation

def load_data(filename):
    df = pd.read_csv(filename)
    arr = df.values
    print arr[:3]
    return arr
# load_data("data.csv")

def fit_logistic_cv(arr, cv=5):
    X=arr[:, :-1]
    y=arr[:, -1]
    print y
    kf_total = cross_validation.KFold(len(X), n_folds=cv) # (indices=True, shuffle=True, random_state=4)
    lr = linear_model.LogisticRegression()
    lr.fit(X,y)
    precisions=cross_validation.cross_val_score(lr, X, y, cv=kf_total, scoring='precision')
    print 'Precision', np.mean(precisions), precisions
    recalls=cross_validation.cross_val_score(lr, X, y, cv=kf_total, scoring='recall')
    print 'Recalls', np.mean(recalls), recalls
    f1s = cross_validation.cross_val_score(lr, X, y, cv=kf_total, scoring='f1')
    print 'F1', np.mean(f1s), f1s


def test_logistic_cv():  # testing above function 
    data_filename = "data.csv"
    fit_logistic_cv(load_data(data_filename))
Alph
  • 391
  • 2
  • 7
  • 18
  • 3
    Unclear why you extracted a numpy array from the pandas df, pandas dfs are compatibly with sklearn methods, you just index the columns as params e.g. `classifier.fit(df['X_train_vals'], df['y_train_vals'])` this is just an indicative example, I don't know what your columns actually are but the point is you just index them and pass them as params, there are plenty of sample code out on the interweb for this – EdChum Feb 24 '15 at 15:56
  • @ EdChum, This is required way to do. I am having trouble getting X_train and y_train from numpy array. – Alph Feb 24 '15 at 16:04
  • 1
    You're going to have explain better, please edit errors into your question plus any additonal code – EdChum Feb 24 '15 at 16:05
  • @EdChum. I used k-fold in cross validation that used arr (len(arr)). I wonder if it is correct. – Alph Feb 24 '15 at 16:31
  • 2
    It'll still work, what's returned are the indices to use to perform slicing on the df row-wise, I think you need to persist with dataframes more because at this stage you have a lot of basic questions and errors which are not that useful to answer here – EdChum Feb 24 '15 at 16:38
  • What is the content of df is the real question, and what is it you want to predict. – Andreas Mueller Feb 25 '15 at 23:40

0 Answers0