1

I want to print the samples from the classification that has been labeled wrong.

I found this code from Sklearn SVM - how to get a list of the wrong predictions?

for idx, input, prediction, label in zip(enumerate(X_test), X_test, predicted, y_test):
    print("No.", idx[0], 'input,',input, ', has been classified as', prediction, 'and should be', label) 

I get this TypeError: 'numpy.int64' object is not iterable

My data consists of text data(emails) from folders that are converted by TFIDF to int, and there are about 250 files that have been misclassified, which I want to list in order to get a deeper look into the files that are misclassified.

Please help me to find a way to list these misclassifications.

The data consists of more than 4000 emails like this:

Email[X_test]: messageid 14149441075861143483javamailevansthyme date thu 13 dec 2001 051749 0800 pst from staylorsdecom to teblokeyenroncom subject flight mimeversion 10 contenttype textplain charsetusascii contenttransferencoding 7bit xfrom taylor sandy staylorsdecom xto teblokeyenroncom xcc xbcc xfolder teblokeymar2002lokey tebinbox xorigin lokeyt xfilename tlokey nonprivilegedpst ive made a tentative reservation on continental to leave thursday dec 20 at 550 pm stop in cleveland no change and arrive in houston at 1033 pm return first class on sunday dec 30 at 1050 am change in cleveland and arrive manchester at 503 pm how about making a reservation to fly back with me you can always cancel and return whenever it cheers me just to know that youd even consider this you need the break and i could use the company i know youd love deb friendtenant and she you and ditto for gracie let me know your thoughts love sandy

And after it is transformed with TfidfVectorizer() and todense(), the email looks like this.

X_test[example]: [[0. 0. 0.03120722 ... 0. 0. 0. ]]

The vaules represent the tf-idf count.

type of X_test: <class 'numpy.matrix'> (4519, 115674)

4519: number of emails within X_test

115674: number of features (unique terms)

The emails are labeled as phish (1) or legit (0).

#Fit motel to data
model = LogisticRegression()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)
proba = model.predict_proba(X_test)

# Scores
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary")
f1 = f1_score(expected, predicted , average="binary")

# Confustion matrix
cm = metrics.confusion_matrix(expected, predicted)
print(cm)

This is when I want to list the misclassifications from X_test.

enter image description here

JFFO
  • 11
  • 2
  • 1
    just put a "if not prediction == label: print(..)" statement into your loop – some_name.py Apr 14 '21 at 09:36
  • Thank you for the fast reply @some_name But I still get TypeError: 'numpy.int64' object is not iterable? – JFFO Apr 14 '21 at 09:45
  • can you create a full example with some artificial test data to reproduce the error? – some_name.py Apr 14 '21 at 09:54
  • The data is listed like this from TfidfVectorizer() an todense() [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]] Number of documents in train data: 4585 Samples per class (train): [4082 503] Number of documents in test data: 4519 Samples per class (test): [4022 497] Labelled as [legit, phish] And fed into a classification model – JFFO Apr 14 '21 at 10:00
  • yes but when I make up some artificial data to run the code it works... so from the information given above I probably cant help... so maybe you can edit your question and make an example which we could run to reproduce the error. Just set up some X_test with just a couple of examples and copy it to serve as the prediction. – some_name.py Apr 14 '21 at 10:04
  • I have updated the question, due to the large text files, it is difficult to upload many examples. – JFFO Apr 14 '21 at 10:27
  • can you show us the output of "type(X_test)" ,"type(predicted)" and "type(y_test)" – some_name.py Apr 14 '21 at 11:16
  • "type(X_test)" gives output "numpy-matrix, and "type(predicted)" gives output "numpy.int64", and "type(y_test)" gives output "list" – JFFO Apr 14 '21 at 12:48
  • there you go... predicted should also be a list or a numpy array. Seems that it is only one number so you cant iterate over it. Make sure all those variables have the same length in the first place. – some_name.py Apr 14 '21 at 13:13
  • My mistake, it works when using Logistic Regression, however when I use SVM linear, the predicted becomes numpy.int64, do you know how to answer my question when the classifier is SVM linear? – JFFO Apr 14 '21 at 13:43
  • seems very strange... model.predict(X_test) should always have the same shape or length then X_test... You should really be careful (and understand) what happens at this point in the code and maybe use some "assert len(prediction) == len(X_test)" to make sure you are not running into problems later on. – some_name.py Apr 14 '21 at 13:53

0 Answers0