I want to print the samples from the classification that has been labeled wrong.
I found this code from Sklearn SVM - how to get a list of the wrong predictions?
for idx, input, prediction, label in zip(enumerate(X_test), X_test, predicted, y_test):
print("No.", idx[0], 'input,',input, ', has been classified as', prediction, 'and should be', label)
I get this TypeError: 'numpy.int64' object is not iterable
My data consists of text data(emails) from folders that are converted by TFIDF to int, and there are about 250 files that have been misclassified, which I want to list in order to get a deeper look into the files that are misclassified.
Please help me to find a way to list these misclassifications.
The data consists of more than 4000 emails like this:
Email[X_test]: messageid 14149441075861143483javamailevansthyme date thu 13 dec 2001 051749 0800 pst from staylorsdecom to teblokeyenroncom subject flight mimeversion 10 contenttype textplain charsetusascii contenttransferencoding 7bit xfrom taylor sandy staylorsdecom xto teblokeyenroncom xcc xbcc xfolder teblokeymar2002lokey tebinbox xorigin lokeyt xfilename tlokey nonprivilegedpst ive made a tentative reservation on continental to leave thursday dec 20 at 550 pm stop in cleveland no change and arrive in houston at 1033 pm return first class on sunday dec 30 at 1050 am change in cleveland and arrive manchester at 503 pm how about making a reservation to fly back with me you can always cancel and return whenever it cheers me just to know that youd even consider this you need the break and i could use the company i know youd love deb friendtenant and she you and ditto for gracie let me know your thoughts love sandy
And after it is transformed with TfidfVectorizer() and todense(), the email looks like this.
X_test[example]: [[0. 0. 0.03120722 ... 0. 0. 0. ]]
The vaules represent the tf-idf count.
type of X_test: <class 'numpy.matrix'> (4519, 115674)
4519: number of emails within X_test
115674: number of features (unique terms)
The emails are labeled as phish (1) or legit (0).
#Fit motel to data
model = LogisticRegression()
model.fit(X_train, y_train)
# make predictions
expected = y_test
predicted = model.predict(X_test)
proba = model.predict_proba(X_test)
# Scores
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary")
f1 = f1_score(expected, predicted , average="binary")
# Confustion matrix
cm = metrics.confusion_matrix(expected, predicted)
print(cm)
This is when I want to list the misclassifications from X_test.