0

I am doing a Logistic Regression with the Elastic Net regularization method. I am trying to predict which variables are associated positively or negatively. An error is occurred after running the accuracy_score(y_true,y_pred), but i got an error: "ValueError: Found input variables with inconsistent numbers of samples: [9076, 9075]". Data frame has a size of 18151 obs. How can I fix the error? Could it be that when I do train_test_split at 50% I get an odd numbered subsample and an even numbered subsample?

X2=df.iloc[:,23:41]
y2=df["diab_inc"].values.reshape(-1,1)
X2_train,X2_test,y2_train,y2_test=train_test_split(X2,y2,test_size=0.5,random_state=1234)

print (len(X2_train),len(X2_test),len(y2_train),len(y2_test))
[9075 9076 9075 9076]

l1_ratio=(.001,.005,.01,.05,.1,.3,.5,.7,.9,1)
select=SelectFromModel(LogisticRegressionCV(cv=5, penalty='elasticnet', solver="saga", l1_ratios=l1_ratio, max_iter=10000)).fit(X2_train, y2_train)
print("Accuracy {0:2%}".format(accuracy_score(y2_test,select.estimator_.predict(X2_train))))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
----> 1 print("Accuracy {0:2%}".format(accuracy_score(y2_test,select.estimator_.predict(X2_train))))

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    200 
    201     # Compute accuracy for each possible representation
--> 202     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    203     check_consistent_length(y_true, y_pred, sample_weight)
    204     if y_type.startswith('multilabel'):

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     81     y_pred : array or indicator matrix
     82     """
---> 83     check_consistent_length(y_true, y_pred)
     84     type_true = type_of_target(y_true)
     85     type_pred = type_of_target(y_pred)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    317     uniques = np.unique(lengths)
    318     if len(uniques) > 1:
--> 319         raise ValueError("Found input variables with inconsistent numbers of"
    320                          " samples: %r" % [int(l) for l in lengths])
    321 

ValueError: Found input variables with inconsistent numbers of samples: [9076, 9075]

1 Answers1

0

What you want to do is to make the predictions for the X2_test data and compare that to the ground truth y2_test. Currently you are using the training data for the prediction. The train and test data have different size, since your full dataset has an odd number of rows and you are splitting it 50%, hence the error.

accuracy_score(y2_test,select.estimator_.predict(X2_test))
mcsoini
  • 6,280
  • 2
  • 15
  • 38