got error while using k fold split in multilabel in sklearn

Question

I would like to do K-fold cross-validation. the code before K-fold cross validation is like this: and it working perfectly

df = pd.read_csv('finalupdatedothers-multilabel.csv')

X= df[['sentences']]

dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
df1 = dfy.stack().reset_index()
df1.columns = ['a','b','c']
y_train_text = df1.groupby('a')['b'].apply(list)

lb = preprocessing.MultiLabelBinarizer()
# Run classifier
stop_words = stopwords.words('english')

classifier=make_pipeline(CountVectorizer(),
                  TfidfTransformer(),
                  #SelectKBest(chi2, k=4),
                  OneVsRestClassifier(SGDClassifier()))

#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

random_state = np.random.RandomState(0)
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train_text, test_size=.2,
                                                    random_state=random_state)
print y_train
# # Binarize the output classes
Y = lb.fit_transform(y_train)
Y_test=lb.transform(y_test)
classifier.fit(X_train, Y)
y_score = classifier.fit(X_train, Y).decision_function(X_test)
print ("y_score"+str(y_score))
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)

#print accuracy_score
print ("accuracy : "+str(accuracy_score(Y_test, predicted)))

print ("micro f-measure "+str(f1_score(Y_test, predicted, average='weighted')))

print("precision"+str(precision_score(Y_test,predicted,average='weighted')))

print("recall"+str(recall_score(Y_test,predicted,average='weighted')))

for item, labels in zip(X_test, all_labels):
    print ('%s => %s' % (item, ', '.join(labels)))

when I change the code to use k fold cross-validation instead of train_tes_split. I got this error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 6008]

Updated with iloc my code to use k-fold cross validation looks like this:

kf = KFold(n_splits=10)
kf.get_n_splits(X)
KFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y_train_text.iloc[train_index], 
                                   y_train_text.iloc[test_index]

would you please let me know which part Im doing incorrectly?

my data looks like this:

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1.0,,,,,,
1,I am detoxing from Lexapro now.,,,,,,,1.0
2,I slowly cut my dosage over several months and took vitamin supplements to help.,,,,,,,1.0

it's the exampe code of sklearn, the problem don't come from your KFold but from the data. Please consider not using X_train name in both size else you are updating the same dataset over and over and that can't be good — Frayal, Aug 23 '18 at 15:15
@Alexis thanks for following, I changed it but still raises error. I have updated my quetsion here. thanks :) — sariii, Aug 23 '18 at 15:21
Yes, I only ask because I recently investigated a phishing email that had nearly identical content as your "data" above... lol ;) — Any Moose, Aug 23 '18 at 15:26
@AnyMoose that's too weird btw. I'm too busy to distract someone, and its not special data they are comment in a website you can also find them. — sariii, Aug 23 '18 at 15:30

got error while using k fold split in multilabel in sklearn

0 Answers0