1

We were given some code for a support vector machine where we are supposed to implement leave one out cross validation. If I understand it correctly leave one out will create as many test sets as there are samples, which means that for a big data set the process will be costly and most likely take quite long to generate results.

I have tried to implement leave one out to the given svm code with only one iteration and with 773 data points in total. I expected it to take some time but as of 2 h later the code is still running without any result, which makes me believe that it might be stuck in some loop or something...

Is there any suggestion as to what might be wrong? I'm not getting any error code either.

The entire code is as following, with the leave one out part is in the last function at the bottom (executed in jupyter notebook online binder):

import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gseapy as gp
from gseapy.plot import gseaplot
import qvalue

from ipywidgets import interact, interact_manual
from ipywidgets import IntSlider, FloatSlider, Dropdown, Text

import sklearn as skl
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.model_selection import LeaveOneOut
from sklearn import svm


interact_enrich=interact_manual.options(manual_name="Enrichment analysis")
interact_plot=interact_manual.options(manual_name="Plot")
interact_calc=interact_manual.options(manual_name="Calculate tests")

interact_gen=interact_manual.options(manual_name="Initialize data")
interact_SVM=interact_manual.options(manual_name="Train SVM")

clinical_data = pd.read_csv('../data/brca_clin.tsv.gz', sep ='\t', index_col=2)
clinical_data = clinical_data.iloc[4:,1:]
expression_data = pd.read_csv('../data/brca.tsv.gz', sep ='\t', index_col=1)
expression_data = expression_data.iloc[:,2:].T
def split_data(clinical_df, expression_df, separator, cond1, cond2):
    try:
        group1 = clinical_df[separator] == cond1
        index1 = clinical_df[group1].index
        group2 = clinical_df[separator] == cond2
        index2 = clinical_df[group2].index
    except:
        print('Clinical condition wrong')
    expression1 = expression_df.loc[index1].dropna()
    expression2 = expression_df.loc[index2].dropna()
    expression = pd.concat([expression1, expression2])
    X = expression.values
    y = np.append(np.repeat(0, len(expression1)), np.repeat(1, len(expression2)))
    display(pd.DataFrame([len(index1),len(index2)], columns = ['Number of points'], index = ['Group 1', 'Group 2']))
    return X, y

def plot_pca_variance(X, scale=False, ncomp = 1):
    if scale:
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
    pca = PCA()
    pca.fit(X)
    plt.rcParams["figure.figsize"] = (20,10)
    sns.set(style='darkgrid', context='talk')
    plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1),np.cumsum(pca.explained_variance_ratio_))
    plt.xlabel('Number of components')
    plt.ylabel('Cumulative explained variance')

    plt.vlines(ncomp, 0, plt.gca().get_ylim()[1], color='r', linestyles = 'dashed')
    h = np.cumsum(pca.explained_variance_ratio_)[ncomp -1]
    plt.hlines(h, 0, plt.gca().get_xlim()[1], color='r', linestyles = 'dashed')
    plt.title(str(ncomp) + ' components, ' + str(round(h, 3)) + ' variance explained')
    plt.show()

def reduce_data(X, n, scale=True):
    if scale:
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
    pca = PCA(n_components=n)
    Xr = pca.fit_transform(X)
    return Xr

def interact_split_data(Criteria, Group_1, Group_2):
    global BRCA_X, BRCA_y
    BRCA_X, BRCA_y = split_data(clinical_data, expression_data, Criteria, Group_1, Group_2)


def interact_SVM_1(Rescale, Max_iterations):
    max_iter = int(Max_iterations)
    loo = LeaveOneOut()
    ac_matrix_train, ac_matrix_test = np.array([]), np.array([]) 
    for train_id, test_id in loo.split(BRCA_X, BRCA_y):
        X_train, X_test, y_train, y_test = BRCA_X[train_id,:], BRCA_X[test_id,:], BRCA_y[train_id],BRCA_y[test_id]
        clf = svm.LinearSVC(C=0.1,max_iter=100000).fit(X_train, y_train) # Train an SVM
        y_train_pred = clf.predict(X_train)
    ac_matrix_train = confusion_matrix(y_train, y_train_pred)
    y_test_pred = clf.predict(X_test)
    ac_matrix_test = confusion_matrix(y_test, y_test_pred)
    display(pd.DataFrame(np.concatenate((ac_matrix_train,ac_matrix_test), axis =1), columns = ["predicted G1 (training)","predicted G2 (training)", "predicted G1 (test)","predicted G2 (test)"],index=["actual G1","actual G2"]))


interact_gen(interact_split_data, Criteria=Text('PR status by ihc'), Group_1 = Text('Positive'), Group_2=Text('Negative'))
interact_SVM(interact_SVM_1, Rescale = False, Max_iterations = Text('1')) ```
Norruas
  • 61
  • 1
  • 9
  • 1
    Might I recommend plotting without the interactions on as a means of troubleshooting? There might be interaction on the backend... – Yaakov Bressler Dec 01 '19 at 01:05
  • 1
    Does this answer your question? [Leave-one-out cross-validation](https://stackoverflow.com/questions/24890684/leave-one-out-cross-validation) – Yaakov Bressler Dec 01 '19 at 01:06
  • A bit late, but for anybody elese who stubles across this: The problem was that only one of the data points was considered. The solution to this was to make i.e a for loop to include the entire dataset. – Norruas Apr 02 '20 at 18:01

0 Answers0