scikit learn output metrics.classification_report into CSV/tab-delimited format

Question

I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model

from __future__ import print_function

# Read **`file.csv`** into a pandas DataFrame

import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)

# define X and y using the original DataFrame
X = merged.text
y = merged.grid

# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

# make class predictions
y_pred_class = nb.predict(X_test_dtm)

# generate classification report
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_class))

And a simplified output of the metrics.classification_report on command line screen looks like this:

             precision  recall   f1-score   support
     12       0.84      0.48      0.61      2843
     13       0.00      0.00      0.00        69
     15       1.00      0.19      0.32       232
     16       0.75      0.02      0.05       965
     33       1.00      0.04      0.07       155
      4       0.59      0.34      0.43      5600
     41       0.63      0.49      0.55      6218
     42       0.00      0.00      0.00       102
     49       0.00      0.00      0.00        11
      5       0.90      0.06      0.12      2010
     50       0.00      0.00      0.00         5
     51       0.96      0.07      0.13      1267
     58       1.00      0.01      0.02       180
     59       0.37      0.80      0.51      8127
      7       0.91      0.05      0.10       579
      8       0.50      0.56      0.53      7555      
    avg/total 0.59      0.48      0.45     35919

I was wondering if there was any way to get the report output into a standard csv file with regular column headers

When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet - Openoffice Calc or Excel, It lumps the results in one column. Looking like this:

I'll be trying to recreate the results as I type this, But have u tried turning the table into a DataFrame using Pandas and then sending the dataframe to csv using `dataframe_name_here.to_csv()` ? Could you also show the code in which you write the results to the csv? — MattR, Sep 23 '16 at 13:58
@MattR I have edited the question and provided the full python code...I was passing the output of the script to a CSV file from Linux command line thus: $ python3 script.py > result.csv — Seun AJAO, Sep 23 '16 at 15:14

score 123 · Answer 1 · edited Apr 09 '21 at 11:52

123

As of scikit-learn v0.20, the easiest way to convert a classification report to a pandas Dataframe is by simply having the report returned as a dict:

report = classification_report(y_test, y_pred, output_dict=True)

and then construct a Dataframe and transpose it:

df = pandas.DataFrame(report).transpose()

From here on, you are free to use the standard pandas methods to generate your desired output formats (CSV, HTML, LaTeX, ...).

See the documentation.

edited Apr 09 '21 at 11:52

desertnaut

57,590
26
140
166

answered Dec 14 '18 at 13:20

janus235

1,421
1
10
12

8

df.to_csv('file_name.csv') for the lazy :) – Prashant Saraswat Nov 11 '20 at 16:26
1

Perfect answer. Minor note: since the output dict accuracy has only one value, it will be repeated in the accuracy row of your dataframe. If you want your export to mirror the sklearn output exactly, you can use the snippet below. `report.update({"accuracy": {"precision": None, "recall": None, "f1-score": report["accuracy"], "support": report['macro avg']['support']}})` – piedpiper Feb 25 '22 at 23:52
This answer should be accepted! – Say OL Jul 15 '23 at 06:58

score 21 · Answer 2 · edited Jun 10 '18 at 19:19

21

If you want the individual scores this should do the job just fine.

import pandas as pd

def classification_report_csv(report):
    report_data = []
    lines = report.split('\n')
    for line in lines[2:-3]:
        row = {}
        row_data = line.split('      ')
        row['class'] = row_data[0]
        row['precision'] = float(row_data[1])
        row['recall'] = float(row_data[2])
        row['f1_score'] = float(row_data[3])
        row['support'] = float(row_data[4])
        report_data.append(row)
    dataframe = pd.DataFrame.from_dict(report_data)
    dataframe.to_csv('classification_report.csv', index = False)

report = classification_report(y_true, y_pred)
classification_report_csv(report)

edited Jun 10 '18 at 19:19

Vlad Călin Buzea

559
5
19

answered Dec 08 '16 at 16:33

kindjacket

1,410
2
15
23

4

row['precision'] = float(row_data[1]) ValueError: could not convert string to float: – user3806649 Dec 29 '17 at 22:17
3

change line ```row_data = line.split(' ')``` by ```row_data = line.split(' ') row_data = list(filter(None, row_data))``` – RomaneG Jul 20 '18 at 15:27
Really cool ,and thanks~ And I make a comment for the split statement: row_data = line.split(' ') , this one should be better like this : row_data = line.split(), because some time the space number in the report string is not equal – Ting Jia Mar 15 '19 at 08:13
Better to replace ```row_data = line.split(' ')``` with ```row_data = ' '.join(line.split()) row_data = row_data.split(' ')``` to account for irregular spaces. – Satheesh K Dec 18 '19 at 13:41

score 14 · Answer 3 · edited Apr 09 '21 at 11:53

Just import pandas as pd and make sure that you set the output_dict parameter which by default is False to True when computing the classification_report. This will result in an classification_report dictionary which you can then pass to a pandas DataFrame method. You may want to transpose the resulting DataFrame to fit the fit the output format that you want. The resulting DataFrame may then be written to a csv file as you wish.

clsf_report = pd.DataFrame(classification_report(y_true = your_y_true, y_pred = your_y_preds5, output_dict=True)).transpose()
clsf_report.to_csv('Your Classification Report Name.csv', index= True)

score 10 · Answer 4 · edited Apr 09 '21 at 11:53

We can get the actual values from the precision_recall_fscore_support function and then put them into data frames. the below code will give the same result, but now in a pandas dataframe:

clf_rep = metrics.precision_recall_fscore_support(true, pred)
out_dict = {
             "precision" :clf_rep[0].round(2)
            ,"recall" : clf_rep[1].round(2)
            ,"f1-score" : clf_rep[2].round(2)
            ,"support" : clf_rep[3]
            }
out_df = pd.DataFrame(out_dict, index = nb.classes_)
avg_tot = (out_df.apply(lambda x: round(x.mean(), 2) if x.name!="support" else  round(x.sum(), 2)).to_frame().T)
avg_tot.index = ["avg/total"]
out_df = out_df.append(avg_tot)
print out_df

Kam Sen · Answer 5 · 2017-09-27T13:00:48.397

While the previous answers are probably all working I found them a bit verbose. The following stores the individual class results as well as the summary line in a single dataframe. Not very sensitive to changes in the report but did the trick for me.

#init snippet and fake data
from io import StringIO
import re
import pandas as pd
from sklearn import metrics
true_label = [1,1,2,2,3,3]
pred_label = [1,2,2,3,3,1]

def report_to_df(report):
    report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
    report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)        
    return(report_df)

#txt report to df
report = metrics.classification_report(true_label, pred_label)
report_df = report_to_df(report)

#store, print, copy...
print (report_df)

Which gives the desired output:

Classes precision   recall  f1-score    support
1   0.5 0.5 0.5 2
2   0.5 0.5 0.5 2
3   0.5 0.5 0.5 2
avg/total   0.5 0.5 0.5 6

score 6 · Answer 6 · edited Apr 09 '21 at 11:55

It's obviously a better idea to just output the classification report as dict:

sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)

But here's a function I made to convert all classes (only classes) results to a pandas dataframe.

def report_to_df(report):
    report = [x.split(' ') for x in report.split('\n')]
    header = ['Class Name']+[x for x in report[0] if x!='']
    values = []
    for row in report[1:-5]:
        row = [value for value in row if value!='']
        if row!=[]:
            values.append(row)
    df = pd.DataFrame(data = values, columns = header)
    return df

score 4 · Answer 7 · edited Apr 09 '21 at 11:54

As mentioned in one of the posts in here, precision_recall_fscore_support is analogous to classification_report.

Then it suffices to use pandas to easily format the data in a columnar format, similar to what classification_report does. Here is an example:

import numpy as np
import pandas as pd

from sklearn.metrics import classification_report
from  sklearn.metrics import precision_recall_fscore_support

np.random.seed(0)

y_true = np.array([0]*400 + [1]*600)
y_pred = np.random.randint(2, size=1000)

def pandas_classification_report(y_true, y_pred):
    metrics_summary = precision_recall_fscore_support(
            y_true=y_true, 
            y_pred=y_pred)
    
    avg = list(precision_recall_fscore_support(
            y_true=y_true, 
            y_pred=y_pred,
            average='weighted'))

    metrics_sum_index = ['precision', 'recall', 'f1-score', 'support']
    class_report_df = pd.DataFrame(
        list(metrics_summary),
        index=metrics_sum_index)
    
    support = class_report_df.loc['support']
    total = support.sum() 
    avg[-1] = total
    
    class_report_df['avg / total'] = avg

    return class_report_df.T

With classification_report You'll get something like:

print(classification_report(y_true=y_true, y_pred=y_pred, digits=6))

Output:

             precision    recall  f1-score   support

          0   0.379032  0.470000  0.419643       400
          1   0.579365  0.486667  0.528986       600

avg / total   0.499232  0.480000  0.485248      1000

Then with our custom funtion pandas_classification_report:

df_class_report = pandas_classification_report(y_true=y_true, y_pred=y_pred)
print(df_class_report)

Output:

             precision    recall  f1-score  support
0             0.379032  0.470000  0.419643    400.0
1             0.579365  0.486667  0.528986    600.0
avg / total   0.499232  0.480000  0.485248   1000.0

Then just save it to csv format (refer to here for other separator formating like sep=';'):

df_class_report.to_csv('my_csv_file.csv',  sep=',')

I open my_csv_file.csv with LibreOffice Calc (although you could use any tabular/spreadsheet editor like excel):

The averages calculated by classification_report are weighted with the support values. — Flynamic, May 10 '18 at 20:32
So it should be `avg = (class_report_df.loc[metrics_sum_index[:-1]] * class_report_df.loc[metrics_sum_index[-1]]).sum(axis=1) / total` — Flynamic, May 10 '18 at 21:19
Nice catch @Flynamic! I figured it out that `precision_recall_fscore_support` has an `average` param. which does just what you suggest! — Raul, May 11 '18 at 19:09

score 3 · Answer 8 · answered May 02 '18 at 20:22

I also found some of the answers a bit verbose. Here is my three line solution, using precision_recall_fscore_support as others have suggested.

import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

report = pd.DataFrame(list(precision_recall_fscore_support(y_true, y_pred)),
            index=['Precision', 'Recall', 'F1-score', 'Support']).T

# Now add the 'Avg/Total' row
report.loc['Avg/Total', :] = precision_recall_fscore_support(y_true, y_test,
    average='weighted')
report.loc['Avg/Total', 'Support'] = report['Support'].sum()

This works, but trying to use the `labels` parameter of `precision_recall_fscore_support` raises, for some reason, `ValueError: y contains previously unseen labels` — Jack Fleeting, Feb 18 '19 at 03:18

score 3 · Answer 9 · answered Feb 04 '20 at 16:19

The simplest and best way I found is:

classes = ['class 1','class 2','class 3']

report = classification_report(Y[test], Y_pred, target_names=classes)

report_path = "report.txt"

text_file = open(report_path, "w")
n = text_file.write(report)
text_file.close()

score 2 · Answer 10 · answered Mar 17 '18 at 09:32

2

Another option is to calculate the underlying data and compose the report on your own. All the statistics you will get by

precision_recall_fscore_support

answered Mar 17 '18 at 09:32

Karel Macek

1,119
2
11
24

Surya · Answer 11 · 2018-05-29T23:00:11.867

Along with example input-output, here's the other function metrics_report_to_df(). Implementing precision_recall_fscore_support from Sklearn metrics should do:

# Generates classification metrics using precision_recall_fscore_support:
from sklearn import metrics
import pandas as pd
import numpy as np; from numpy import random

# Simulating true and predicted labels as test dataset: 
np.random.seed(10)
y_true = np.array([0]*300 + [1]*700)
y_pred = np.random.randint(2, size=1000)

# Here's the custom function returning classification report dataframe:
def metrics_report_to_df(ytrue, ypred):
    precision, recall, fscore, support = metrics.precision_recall_fscore_support(ytrue, ypred)
    classification_report = pd.concat(map(pd.DataFrame, [precision, recall, fscore, support]), axis=1)
    classification_report.columns = ["precision", "recall", "f1-score", "support"] # Add row w "avg/total"
    classification_report.loc['avg/Total', :] = metrics.precision_recall_fscore_support(ytrue, ypred, average='weighted')
    classification_report.loc['avg/Total', 'support'] = classification_report['support'].sum() 
    return(classification_report)

# Provide input as true_label and predicted label (from classifier)
classification_report = metrics_report_to_df(y_true, y_pred)

# Here's the output (metrics report transformed to dataframe )
In [1047]: classification_report
Out[1047]: 
           precision    recall  f1-score  support
0           0.300578  0.520000  0.380952    300.0
1           0.700624  0.481429  0.570703    700.0
avg/Total   0.580610  0.493000  0.513778   1000.0

score 1 · Answer 12 · answered Jan 22 '19 at 05:10

I have modified @kindjacket's answer. Try this:

import collections
def classification_report_df(report):
    report_data = []
    lines = report.split('\n')
    del lines[-5]
    del lines[-1]
    del lines[1]
    for line in lines[1:]:
        row = collections.OrderedDict()
        row_data = line.split()
        row_data = list(filter(None, row_data))
        row['class'] = row_data[0] + " " + row_data[1]
        row['precision'] = float(row_data[2])
        row['recall'] = float(row_data[3])
        row['f1_score'] = float(row_data[4])
        row['support'] = int(row_data[5])
        report_data.append(row)
    df = pd.DataFrame.from_dict(report_data)
    df.set_index('class', inplace=True)
    return df

You can just export that df to csv using pandas

The line `row['support'] = int(row_data[5])` raises `IndexError: list index out of range` — Jack Fleeting, Feb 18 '19 at 03:00

score 1 · Answer 13 · answered Nov 30 '22 at 09:59

Below function can be used to get the classification report as a pandas dataframe which then can be dumped as a csv file. The resulting csv file will look exactly like when we print the classification report.

import pandas as pd
from sklearn import metrics


def classification_report_df(y_true, y_pred):
    report = metrics.classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    df_report.round(3)        
    df_report = df_report.astype({'support': int})    
    df_report.loc['accuracy',['precision','recall','support']] = [None,None,df_report.loc['macro avg']['support']]
    return df_report


report = classification_report_df(y_true, y_pred)
report.to_csv("<Full Path to Save CSV>")

score 0 · Answer 14 · answered Jul 24 '17 at 14:51

def to_table(report):
    report = report.splitlines()
    res = []
    res.append(['']+report[0].split())
    for row in report[2:-2]:
       res.append(row.split())
    lr = report[-1].split()
    res.append([' '.join(lr[:3])]+lr[3:])
    return np.array(res)

returns a numpy array which can be turned to pandas dataframe or just be saved as csv file.

score 0 · Answer 15 · answered Apr 02 '18 at 08:35

This is my code for 2 classes(pos,neg) classification

report = metrics.precision_recall_fscore_support(true_labels,predicted_labels,labels=classes)
        rowDicionary["precision_pos"] = report[0][0]
        rowDicionary["recall_pos"] = report[1][0]
        rowDicionary["f1-score_pos"] = report[2][0]
        rowDicionary["support_pos"] = report[3][0]
        rowDicionary["precision_neg"] = report[0][1]
        rowDicionary["recall_neg"] = report[1][1]
        rowDicionary["f1-score_neg"] = report[2][1]
        rowDicionary["support_neg"] = report[3][1]
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writerow(rowDicionary)

DeshDeep Singh · Answer 16 · 2021-04-14T07:26:15.297

I have written below code to extract the classification report and save it to an excel file:

def classifcation_report_processing(model_to_report):
    tmp = list()
    for row in model_to_report.split("\n"):
        parsed_row = [x for x in row.split("  ") if len(x) > 0]
        if len(parsed_row) > 0:
            tmp.append(parsed_row)

    # Store in dictionary
    measures = tmp[0]

    D_class_data = defaultdict(dict)
    for row in tmp[1:]:
        class_label = row[0]
        for j, m in enumerate(measures):
            D_class_data[class_label][m.strip()] = float(row[j + 1].strip())
    save_report = pd.DataFrame.from_dict(D_class_data).T
    path_to_save = os.getcwd() +'/Classification_report.xlsx'
    save_report.to_excel(path_to_save, index=True)
    return save_report.head(5)

To call the function below line can be used anywhere in the program:

saving_CL_report_naive_bayes = classifcation_report_processing(classification_report(y_val, prediction))

The output looks like below:

score 0 · Answer 17 · answered Mar 15 '19 at 04:08

0

I had the same problem what i did was, paste the string output of metrics.classification_report into google sheets or excel and split the text into columns by custom 5 whitespaces.

answered Mar 15 '19 at 04:08

Nitin Kumar

1

score 0 · Answer 18 · answered May 17 '21 at 11:05

Definitely worth using:

sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)

But a slightly revised version of the function by Yash Nag is as follows. The function includes the accuracy, macro accuracy and weighted accuracy rows along with the classes:

def classification_report_to_dataframe(str_representation_of_report):
    split_string = [x.split(' ') for x in str_representation_of_report.split('\n')]
    column_names = ['']+[x for x in split_string[0] if x!='']
    values = []
    for table_row in split_string[1:-1]:
        table_row = [value for value in table_row if value!='']
        if table_row!=[]:
            values.append(table_row)
    for i in values:
        for j in range(len(i)):
            if i[1] == 'avg':
                i[0:2] = [' '.join(i[0:2])]
            if len(i) == 3:
                i.insert(1,np.nan)
                i.insert(2, np.nan)
            else:
                pass
    report_to_df = pd.DataFrame(data=values, columns=column_names)
    return report_to_df

The output for a test classification report may be found here

score -2 · Answer 19 · answered Sep 23 '16 at 15:53

The way I have always solved output problems is like what I've mentioned in my previous comment, I've converted my output to a DataFrame. Not only is it incredibly easy to send to files (see here), but Pandas is really easy to manipulate the data structure. The other way I have solved this is writing the output line-by-line using CSV and specifically using writerow.

If you manage to get the output into a dataframe it would be

dataframe_name_here.to_csv()

or if using CSV it would be something like the example they provide in the CSV link.

thanks I have tried to use a data frame; `Result = metrics.classification_report(y_test, y_pred_class); df = pd.DataFrame(Result); df.to_csv(results.csv, sep='\t')` but got an error _pandas.core.common.PandasError: DataFrame constructor not properly called!_ — Seun AJAO, Sep 23 '16 at 18:49

scikit learn output metrics.classification_report into CSV/tab-delimited format

19 Answers19

Linked