0

I would like to do K-fold cross-validation. the code before K-fold cross validation is like this: and it working perfectly

df = pd.read_csv('finalupdatedothers-multilabel.csv')

X= df[['sentences']]

dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
df1 = dfy.stack().reset_index()
df1.columns = ['a','b','c']
y_train_text = df1.groupby('a')['b'].apply(list)

lb = preprocessing.MultiLabelBinarizer()
# Run classifier
stop_words = stopwords.words('english')

classifier=make_pipeline(CountVectorizer(),
                  TfidfTransformer(),
                  #SelectKBest(chi2, k=4),
                  OneVsRestClassifier(SGDClassifier()))

#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

random_state = np.random.RandomState(0)
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train_text, test_size=.2,
                                                    random_state=random_state)
print y_train
# # Binarize the output classes
Y = lb.fit_transform(y_train)
Y_test=lb.transform(y_test)
classifier.fit(X_train, Y)
y_score = classifier.fit(X_train, Y).decision_function(X_test)
print ("y_score"+str(y_score))
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)

#print accuracy_score
print ("accuracy : "+str(accuracy_score(Y_test, predicted)))

print ("micro f-measure "+str(f1_score(Y_test, predicted, average='weighted')))

print("precision"+str(precision_score(Y_test,predicted,average='weighted')))

print("recall"+str(recall_score(Y_test,predicted,average='weighted')))

for item, labels in zip(X_test, all_labels):
    print ('%s => %s' % (item, ', '.join(labels)))

when I change the code to use k fold cross-validation instead of train_tes_split. I got this error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 6008]

Updated with iloc my code to use k-fold cross validation looks like this:

kf = KFold(n_splits=10)
kf.get_n_splits(X)
KFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y_train_text.iloc[train_index], 
                                   y_train_text.iloc[test_index]

would you please let me know which part Im doing incorrectly?

my data looks like this:

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1.0,,,,,,
1,I am detoxing from Lexapro now.,,,,,,,1.0
2,I slowly cut my dosage over several months and took vitamin supplements to help.,,,,,,,1.0
sariii
  • 2,020
  • 6
  • 29
  • 57
  • it's the exampe code of sklearn, the problem don't come from your KFold but from the data. Please consider not using X_train name in both size else you are updating the same dataset over and over and that can't be good – Frayal Aug 23 '18 at 15:15
  • Are you the person that sent me that junk email... – Any Moose Aug 23 '18 at 15:21
  • @Alexis thanks for following, I changed it but still raises error. I have updated my quetsion here. thanks :) – sariii Aug 23 '18 at 15:21
  • 1
    @AnyMoose are you ok? – sariii Aug 23 '18 at 15:22
  • Yes, I only ask because I recently investigated a phishing email that had nearly identical content as your "data" above... lol ;) – Any Moose Aug 23 '18 at 15:26
  • 1
    @AnyMoose that's too weird btw. I'm too busy to distract someone, and its not special data they are comment in a website you can also find them. – sariii Aug 23 '18 at 15:30

0 Answers0