52

I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model

from __future__ import print_function

# Read **`file.csv`** into a pandas DataFrame

import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)

# define X and y using the original DataFrame
X = merged.text
y = merged.grid

# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

# make class predictions
y_pred_class = nb.predict(X_test_dtm)

# generate classification report
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_class))

And a simplified output of the metrics.classification_report on command line screen looks like this:

             precision  recall   f1-score   support
     12       0.84      0.48      0.61      2843
     13       0.00      0.00      0.00        69
     15       1.00      0.19      0.32       232
     16       0.75      0.02      0.05       965
     33       1.00      0.04      0.07       155
      4       0.59      0.34      0.43      5600
     41       0.63      0.49      0.55      6218
     42       0.00      0.00      0.00       102
     49       0.00      0.00      0.00        11
      5       0.90      0.06      0.12      2010
     50       0.00      0.00      0.00         5
     51       0.96      0.07      0.13      1267
     58       1.00      0.01      0.02       180
     59       0.37      0.80      0.51      8127
      7       0.91      0.05      0.10       579
      8       0.50      0.56      0.53      7555      
    avg/total 0.59      0.48      0.45     35919

I was wondering if there was any way to get the report output into a standard csv file with regular column headers

When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet - Openoffice Calc or Excel, It lumps the results in one column. Looking like this:

enter image description here

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Seun AJAO
  • 611
  • 1
  • 9
  • 15
  • 1
    I'll be trying to recreate the results as I type this, But have u tried turning the table into a DataFrame using Pandas and then sending the dataframe to csv using `dataframe_name_here.to_csv()` ? Could you also show the code in which you write the results to the csv? – MattR Sep 23 '16 at 13:58
  • 1
    @MattR I have edited the question and provided the full python code...I was passing the output of the script to a CSV file from Linux command line thus: $ python3 script.py > result.csv – Seun AJAO Sep 23 '16 at 15:14

19 Answers19

123

As of scikit-learn v0.20, the easiest way to convert a classification report to a pandas Dataframe is by simply having the report returned as a dict:

report = classification_report(y_test, y_pred, output_dict=True)

and then construct a Dataframe and transpose it:

df = pandas.DataFrame(report).transpose()

From here on, you are free to use the standard pandas methods to generate your desired output formats (CSV, HTML, LaTeX, ...).

See the documentation.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
janus235
  • 1,421
  • 1
  • 10
  • 12
  • 8
    df.to_csv('file_name.csv') for the lazy :) – Prashant Saraswat Nov 11 '20 at 16:26
  • 1
    Perfect answer. Minor note: since the output dict accuracy has only one value, it will be repeated in the accuracy row of your dataframe. If you want your export to mirror the sklearn output exactly, you can use the snippet below. `report.update({"accuracy": {"precision": None, "recall": None, "f1-score": report["accuracy"], "support": report['macro avg']['support']}})` – piedpiper Feb 25 '22 at 23:52
  • This answer should be accepted! – Say OL Jul 15 '23 at 06:58
21

If you want the individual scores this should do the job just fine.

import pandas as pd

def classification_report_csv(report):
    report_data = []
    lines = report.split('\n')
    for line in lines[2:-3]:
        row = {}
        row_data = line.split('      ')
        row['class'] = row_data[0]
        row['precision'] = float(row_data[1])
        row['recall'] = float(row_data[2])
        row['f1_score'] = float(row_data[3])
        row['support'] = float(row_data[4])
        report_data.append(row)
    dataframe = pd.DataFrame.from_dict(report_data)
    dataframe.to_csv('classification_report.csv', index = False)

report = classification_report(y_true, y_pred)
classification_report_csv(report)
kindjacket
  • 1,410
  • 2
  • 15
  • 23
  • 4
    row['precision'] = float(row_data[1]) ValueError: could not convert string to float: – user3806649 Dec 29 '17 at 22:17
  • 3
    change line ```row_data = line.split(' ')``` by ```row_data = line.split(' ') row_data = list(filter(None, row_data))``` – RomaneG Jul 20 '18 at 15:27
  • Really cool ,and thanks~ And I make a comment for the split statement: row_data = line.split(' ') , this one should be better like this : row_data = line.split(), because some time the space number in the report string is not equal – Ting Jia Mar 15 '19 at 08:13
  • Better to replace ```row_data = line.split(' ')``` with ```row_data = ' '.join(line.split()) row_data = row_data.split(' ')``` to account for irregular spaces. – Satheesh K Dec 18 '19 at 13:41
14

Just import pandas as pd and make sure that you set the output_dict parameter which by default is False to True when computing the classification_report. This will result in an classification_report dictionary which you can then pass to a pandas DataFrame method. You may want to transpose the resulting DataFrame to fit the fit the output format that you want. The resulting DataFrame may then be written to a csv file as you wish.

clsf_report = pd.DataFrame(classification_report(y_true = your_y_true, y_pred = your_y_preds5, output_dict=True)).transpose()
clsf_report.to_csv('Your Classification Report Name.csv', index= True)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Samuel Nde
  • 2,565
  • 2
  • 23
  • 23
10

We can get the actual values from the precision_recall_fscore_support function and then put them into data frames. the below code will give the same result, but now in a pandas dataframe:

clf_rep = metrics.precision_recall_fscore_support(true, pred)
out_dict = {
             "precision" :clf_rep[0].round(2)
            ,"recall" : clf_rep[1].round(2)
            ,"f1-score" : clf_rep[2].round(2)
            ,"support" : clf_rep[3]
            }
out_df = pd.DataFrame(out_dict, index = nb.classes_)
avg_tot = (out_df.apply(lambda x: round(x.mean(), 2) if x.name!="support" else  round(x.sum(), 2)).to_frame().T)
avg_tot.index = ["avg/total"]
out_df = out_df.append(avg_tot)
print out_df
desertnaut
  • 57,590
  • 26
  • 140
  • 166
pka32
  • 5,176
  • 1
  • 17
  • 21
6

While the previous answers are probably all working I found them a bit verbose. The following stores the individual class results as well as the summary line in a single dataframe. Not very sensitive to changes in the report but did the trick for me.

#init snippet and fake data
from io import StringIO
import re
import pandas as pd
from sklearn import metrics
true_label = [1,1,2,2,3,3]
pred_label = [1,2,2,3,3,1]

def report_to_df(report):
    report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
    report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)        
    return(report_df)

#txt report to df
report = metrics.classification_report(true_label, pred_label)
report_df = report_to_df(report)

#store, print, copy...
print (report_df)

Which gives the desired output:

Classes precision   recall  f1-score    support
1   0.5 0.5 0.5 2
2   0.5 0.5 0.5 2
3   0.5 0.5 0.5 2
avg/total   0.5 0.5 0.5 6
Kam Sen
  • 1,098
  • 1
  • 11
  • 14
6

It's obviously a better idea to just output the classification report as dict:

sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)

But here's a function I made to convert all classes (only classes) results to a pandas dataframe.

def report_to_df(report):
    report = [x.split(' ') for x in report.split('\n')]
    header = ['Class Name']+[x for x in report[0] if x!='']
    values = []
    for row in report[1:-5]:
        row = [value for value in row if value!='']
        if row!=[]:
            values.append(row)
    df = pd.DataFrame(data = values, columns = header)
    return df
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Yash Nag
  • 1,096
  • 12
  • 16
4

As mentioned in one of the posts in here, precision_recall_fscore_support is analogous to classification_report.

Then it suffices to use pandas to easily format the data in a columnar format, similar to what classification_report does. Here is an example:

import numpy as np
import pandas as pd

from sklearn.metrics import classification_report
from  sklearn.metrics import precision_recall_fscore_support

np.random.seed(0)

y_true = np.array([0]*400 + [1]*600)
y_pred = np.random.randint(2, size=1000)

def pandas_classification_report(y_true, y_pred):
    metrics_summary = precision_recall_fscore_support(
            y_true=y_true, 
            y_pred=y_pred)
    
    avg = list(precision_recall_fscore_support(
            y_true=y_true, 
            y_pred=y_pred,
            average='weighted'))

    metrics_sum_index = ['precision', 'recall', 'f1-score', 'support']
    class_report_df = pd.DataFrame(
        list(metrics_summary),
        index=metrics_sum_index)
    
    support = class_report_df.loc['support']
    total = support.sum() 
    avg[-1] = total
    
    class_report_df['avg / total'] = avg

    return class_report_df.T

With classification_report You'll get something like:

print(classification_report(y_true=y_true, y_pred=y_pred, digits=6))

Output:

             precision    recall  f1-score   support

          0   0.379032  0.470000  0.419643       400
          1   0.579365  0.486667  0.528986       600

avg / total   0.499232  0.480000  0.485248      1000

Then with our custom funtion pandas_classification_report:

df_class_report = pandas_classification_report(y_true=y_true, y_pred=y_pred)
print(df_class_report)

Output:

             precision    recall  f1-score  support
0             0.379032  0.470000  0.419643    400.0
1             0.579365  0.486667  0.528986    600.0
avg / total   0.499232  0.480000  0.485248   1000.0

Then just save it to csv format (refer to here for other separator formating like sep=';'):

df_class_report.to_csv('my_csv_file.csv',  sep=',')

I open my_csv_file.csv with LibreOffice Calc (although you could use any tabular/spreadsheet editor like excel): Result open with LibreOffice

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Raul
  • 701
  • 7
  • 6
  • 1
    The averages calculated by classification_report are weighted with the support values. – Flynamic May 10 '18 at 20:32
  • 1
    So it should be `avg = (class_report_df.loc[metrics_sum_index[:-1]] * class_report_df.loc[metrics_sum_index[-1]]).sum(axis=1) / total` – Flynamic May 10 '18 at 21:19
  • 2
    Nice catch @Flynamic! I figured it out that `precision_recall_fscore_support` has an `average` param. which does just what you suggest! – Raul May 11 '18 at 19:09
3

I also found some of the answers a bit verbose. Here is my three line solution, using precision_recall_fscore_support as others have suggested.

import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

report = pd.DataFrame(list(precision_recall_fscore_support(y_true, y_pred)),
            index=['Precision', 'Recall', 'F1-score', 'Support']).T

# Now add the 'Avg/Total' row
report.loc['Avg/Total', :] = precision_recall_fscore_support(y_true, y_test,
    average='weighted')
report.loc['Avg/Total', 'Support'] = report['Support'].sum()
elz
  • 5,338
  • 3
  • 28
  • 30
  • This works, but trying to use the `labels` parameter of `precision_recall_fscore_support` raises, for some reason, `ValueError: y contains previously unseen labels` – Jack Fleeting Feb 18 '19 at 03:18
3

The simplest and best way I found is:

classes = ['class 1','class 2','class 3']

report = classification_report(Y[test], Y_pred, target_names=classes)

report_path = "report.txt"

text_file = open(report_path, "w")
n = text_file.write(report)
text_file.close()
2

Another option is to calculate the underlying data and compose the report on your own. All the statistics you will get by

precision_recall_fscore_support
Karel Macek
  • 1,119
  • 2
  • 11
  • 24
2

Along with example input-output, here's the other function metrics_report_to_df(). Implementing precision_recall_fscore_support from Sklearn metrics should do:

# Generates classification metrics using precision_recall_fscore_support:
from sklearn import metrics
import pandas as pd
import numpy as np; from numpy import random

# Simulating true and predicted labels as test dataset: 
np.random.seed(10)
y_true = np.array([0]*300 + [1]*700)
y_pred = np.random.randint(2, size=1000)

# Here's the custom function returning classification report dataframe:
def metrics_report_to_df(ytrue, ypred):
    precision, recall, fscore, support = metrics.precision_recall_fscore_support(ytrue, ypred)
    classification_report = pd.concat(map(pd.DataFrame, [precision, recall, fscore, support]), axis=1)
    classification_report.columns = ["precision", "recall", "f1-score", "support"] # Add row w "avg/total"
    classification_report.loc['avg/Total', :] = metrics.precision_recall_fscore_support(ytrue, ypred, average='weighted')
    classification_report.loc['avg/Total', 'support'] = classification_report['support'].sum() 
    return(classification_report)

# Provide input as true_label and predicted label (from classifier)
classification_report = metrics_report_to_df(y_true, y_pred)

# Here's the output (metrics report transformed to dataframe )
In [1047]: classification_report
Out[1047]: 
           precision    recall  f1-score  support
0           0.300578  0.520000  0.380952    300.0
1           0.700624  0.481429  0.570703    700.0
avg/Total   0.580610  0.493000  0.513778   1000.0
Surya
  • 11,002
  • 4
  • 57
  • 39
1

I have modified @kindjacket's answer. Try this:

import collections
def classification_report_df(report):
    report_data = []
    lines = report.split('\n')
    del lines[-5]
    del lines[-1]
    del lines[1]
    for line in lines[1:]:
        row = collections.OrderedDict()
        row_data = line.split()
        row_data = list(filter(None, row_data))
        row['class'] = row_data[0] + " " + row_data[1]
        row['precision'] = float(row_data[2])
        row['recall'] = float(row_data[3])
        row['f1_score'] = float(row_data[4])
        row['support'] = int(row_data[5])
        report_data.append(row)
    df = pd.DataFrame.from_dict(report_data)
    df.set_index('class', inplace=True)
    return df

You can just export that df to csv using pandas

Kevin
  • 3,077
  • 6
  • 31
  • 77
1

Below function can be used to get the classification report as a pandas dataframe which then can be dumped as a csv file. The resulting csv file will look exactly like when we print the classification report.

import pandas as pd
from sklearn import metrics


def classification_report_df(y_true, y_pred):
    report = metrics.classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    df_report.round(3)        
    df_report = df_report.astype({'support': int})    
    df_report.loc['accuracy',['precision','recall','support']] = [None,None,df_report.loc['macro avg']['support']]
    return df_report


report = classification_report_df(y_true, y_pred)
report.to_csv("<Full Path to Save CSV>")
KaranKakwani
  • 160
  • 12
0
def to_table(report):
    report = report.splitlines()
    res = []
    res.append(['']+report[0].split())
    for row in report[2:-2]:
       res.append(row.split())
    lr = report[-1].split()
    res.append([' '.join(lr[:3])]+lr[3:])
    return np.array(res)

returns a numpy array which can be turned to pandas dataframe or just be saved as csv file.

Sipan17
  • 1
  • 2
0

This is my code for 2 classes(pos,neg) classification

report = metrics.precision_recall_fscore_support(true_labels,predicted_labels,labels=classes)
        rowDicionary["precision_pos"] = report[0][0]
        rowDicionary["recall_pos"] = report[1][0]
        rowDicionary["f1-score_pos"] = report[2][0]
        rowDicionary["support_pos"] = report[3][0]
        rowDicionary["precision_neg"] = report[0][1]
        rowDicionary["recall_neg"] = report[1][1]
        rowDicionary["f1-score_neg"] = report[2][1]
        rowDicionary["support_neg"] = report[3][1]
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writerow(rowDicionary)
Piotr Badura
  • 1,574
  • 1
  • 13
  • 17
0

I have written below code to extract the classification report and save it to an excel file:

def classifcation_report_processing(model_to_report):
    tmp = list()
    for row in model_to_report.split("\n"):
        parsed_row = [x for x in row.split("  ") if len(x) > 0]
        if len(parsed_row) > 0:
            tmp.append(parsed_row)

    # Store in dictionary
    measures = tmp[0]

    D_class_data = defaultdict(dict)
    for row in tmp[1:]:
        class_label = row[0]
        for j, m in enumerate(measures):
            D_class_data[class_label][m.strip()] = float(row[j + 1].strip())
    save_report = pd.DataFrame.from_dict(D_class_data).T
    path_to_save = os.getcwd() +'/Classification_report.xlsx'
    save_report.to_excel(path_to_save, index=True)
    return save_report.head(5)

To call the function below line can be used anywhere in the program:

saving_CL_report_naive_bayes = classifcation_report_processing(classification_report(y_val, prediction))

The output looks like below:

enter image description here

DeshDeep Singh
  • 1,817
  • 2
  • 23
  • 43
0

I had the same problem what i did was, paste the string output of metrics.classification_report into google sheets or excel and split the text into columns by custom 5 whitespaces.

0

Definitely worth using:

sklearn.metrics.classification_report(y_true, y_pred, output_dict=True)

But a slightly revised version of the function by Yash Nag is as follows. The function includes the accuracy, macro accuracy and weighted accuracy rows along with the classes:

def classification_report_to_dataframe(str_representation_of_report):
    split_string = [x.split(' ') for x in str_representation_of_report.split('\n')]
    column_names = ['']+[x for x in split_string[0] if x!='']
    values = []
    for table_row in split_string[1:-1]:
        table_row = [value for value in table_row if value!='']
        if table_row!=[]:
            values.append(table_row)
    for i in values:
        for j in range(len(i)):
            if i[1] == 'avg':
                i[0:2] = [' '.join(i[0:2])]
            if len(i) == 3:
                i.insert(1,np.nan)
                i.insert(2, np.nan)
            else:
                pass
    report_to_df = pd.DataFrame(data=values, columns=column_names)
    return report_to_df

The output for a test classification report may be found here

-2

The way I have always solved output problems is like what I've mentioned in my previous comment, I've converted my output to a DataFrame. Not only is it incredibly easy to send to files (see here), but Pandas is really easy to manipulate the data structure. The other way I have solved this is writing the output line-by-line using CSV and specifically using writerow.

If you manage to get the output into a dataframe it would be

dataframe_name_here.to_csv()

or if using CSV it would be something like the example they provide in the CSV link.

MattR
  • 4,887
  • 9
  • 40
  • 67
  • 2
    thanks I have tried to use a data frame; `Result = metrics.classification_report(y_test, y_pred_class); df = pd.DataFrame(Result); df.to_csv(results.csv, sep='\t')` but got an error _pandas.core.common.PandasError: DataFrame constructor not properly called!_ – Seun AJAO Sep 23 '16 at 18:49