Calculate Confusion Matrix of a FastText Classifier model

Question

I'm calculating for a Facebook FastText classifier model the confusion matrix in this way:

#!/usr/local/bin/python3

import argparse
import numpy as np
from sklearn.metrics import confusion_matrix


def parse_labels(path):
    with open(path, 'r') as f:
        return np.array(list(map(lambda x: int(x[9:]), f.read().split())))


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Display confusion matrix.')
    parser.add_argument('test', help='Path to test labels')
    parser.add_argument('predict', help='Path to predictions')
    args = parser.parse_args()
    test_labels = parse_labels(args.test)
    pred_labels = parse_labels(args.predict)

    print(test_labels)
    print(pred_labels)

    eq = test_labels == pred_labels
    print("Accuracy: " + str(eq.sum() / len(test_labels)))
    print(confusion_matrix(test_labels, pred_labels))

My predictions and test set are like

$ head -n10 /root/pexp 
__label__spam
__label__verified
__label__verified
__label__spam
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified

$ head -n10 /root/dataset_test.csv 
__label__spam
__label__verified
__label__verified
__label__spam
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified

Predictions of the model has been calculated over the test set in this way:

./fasttext predict /root/my_model.bin /root/dataset_test.csv > /root/pexp

I'm then going the calculate the FastText Confusion Matrix:

$ ./confusion.py /root/dataset_test.csv /root/pexp

but I'm stuck with this error:

Traceback (most recent call last):
  File "./confusion.py", line 18, in <module>
    test_labels = parse_labels(args.test)
  File "./confusion.py", line 10, in parse_labels
    return np.array(list(map(lambda x: int(x[9:]), f.read().split())))
  File "./confusion.py", line 10, in <lambda>
    return np.array(list(map(lambda x: int(x[9:]), f.read().split())))
ValueError: invalid literal for int() with base 10: 'spam'

I have fixed the script as suggested to handle non numeric labels:

def parse_labels(path):
    with open(path, 'r') as f:
        return np.array(list(map(lambda x: x[9:], f.read().split())))

Also, in the case of FastText it's possibile that the test set will have normalized labels (without the prefix __label__) at some point, so to convert back to the prefix you can do like:

awk 'BEGIN{FS=OFS="\t"}{ $1 = "__label__" tolower($1) }1' /root/dataset_test.csv  > /root/dataset_test_norm.csv

See here about this.

Also, the input test file must be cut of the other columns than the label column:

cut -f 1 -d$'\t' /root/dataset_test_norm.csv > /root/dataset_test_norm_label.csv

So finally we get the Confusion Matrix:

$ ./confusion.py /root/dataset_test_norm_label.csv /root/pexp
Accuracy: 0.998852852227
[[9432    21]
 [    3 14543]]

My final solution is here.

[UPDATE]

The script is now working fine. I have added the Confusion Matrix calculation script directly in my FastText Node.js implementation, FastText.js here.

Your script is wrong and expects number in the given input file (look at the parse_labels method), whereas tou have text labels. — unautre, Oct 30 '17 at 17:51
uhm so you are referring to the the `return np.array(list(map(lambda x: int(x[9:]), f.read().split())))` method that is parsing the labels... — loretoparisi, Oct 30 '17 at 17:58
Exactly. If I understand well, that line expects everything after the ninth character of the line to form an integer number ; and that's not what your data looks like, at all. — unautre, Oct 30 '17 at 18:32
yes since `fasttext` by defaults add the `__label__` prefix that is exactly 9 chars. So my guess is why the labels is a number and not a string there...cause the label should be a string normally... — loretoparisi, Oct 30 '17 at 18:34
There is no need to remove `__label__` at all. The comparison in `eq = test_labels == pred_labels` does indeed compare strings here. So you can improve `parse_labels()` a bit: `return np.array(f.readlines())` — sgelb, Dec 31 '18 at 12:00
@sgelb thank you! I have recently updated the script fixing some graph issues. If you submit a PR i would merge that, thank you. Here - https://github.com/loretoparisi/fasttext.js#confusion-matrix — loretoparisi, Jan 01 '19 at 20:09

Ramkrishan Sahu · Accepted Answer · 2020-05-25T02:55:36.717

3

from sklearn.metrics import confusion_matrix

# predict the data
df["predicted"] = df["text"].apply(lambda x: model.predict(x)[0][0])

# Create the confusion matrix
confusion_matrix(df["labeled"], df["predicted"])


## OutPut:
# array([[5823,    8,  155,    1],
#        [ 199,   51,   22,    0],
#        [ 561,    2,  764,    0],
#        [  48,    0,    4,    4]], dtype=int64)

edited May 25 '20 at 02:55

answered May 25 '20 at 02:50

Ramkrishan Sahu

194
1
5

Calculate Confusion Matrix of a FastText Classifier model

1 Answers1