strsplit() output as a dataframe in r

Question

I have some results from a model in Python which i have saved as a .txt to render in RMarkdown.

The .txt is this.

             precision    recall  f1-score   support

          0       0.71      0.83      0.77      1078
          1       0.76      0.61      0.67       931

avg / total       0.73      0.73      0.72      2009

I read the file into r as,

x <- read.table(file = 'report.txt', fill = T, sep = '\n')

When i save this, r saves the results as one column (V1) instead of 5 columns as below,

                                                    V1
1              precision    recall  f1-score   support
2           0       0.71      0.83      0.77      1078
3           1       0.76      0.61      0.67       931
4 avg / total       0.73      0.73      0.72      2009

I tried using strsplit() to split the columns, but doesn't work.

strsplit(as.character(x$V1), split = "|", fixed = T)

May be strsplit() is not the right approach? How do i get around this so that i have a [4x5] dataframe.

Thanks a lot.

Honestly I think you should go back to your Python script and have it output a format more amenable to `read.table`, which expects some number of columns, with a header for each. Otherwise, you are going to have to do some olympics here in R. — Tim Biegeleisen, Sep 20 '18 at 13:30
waaay eaasier to let python export it's output to a proper csv-file — Wimpel, Sep 20 '18 at 13:31
If you like the r-solution you should be able to do this with matrix( data = x$V1,ncol = 5, byrow = TRUE). Assuming that x$V1 gives the vector of all your data with x$V1[1] == 0, x$V1[2] == 0.71, x$V1[3] == 0.83 and so on, then the matrix command should restructure the data in the desired form (hoepfully :D) — TinglTanglBob, Sep 20 '18 at 13:41
I should probably go back to python as suggested, until a i find cleaner R solution. This is way complicated in R. — der_radler, Sep 20 '18 at 13:53
I think it shouldn't be. Maybe i got you wrong in my previous post, and now i think this can be solved with correct specification of sep and header in read.table. Maybe read.table(file = 'report.txt', header = TRUE, sep = '\t') allready does the trick? — TinglTanglBob, Sep 20 '18 at 14:09
@TinglTanglBob not really. Its the same output. I fixed this in python itself. :) — der_radler, Sep 20 '18 at 15:19

score 1 · Answer 1 · answered Sep 20 '18 at 15:34

1

Not very elegant, but this works. First we read the raw text, then we use regex to clean up, delete white space, and convert to csv readable format. Then we read the csv.

library(stringr)
library(magrittr)
library(purrr)

text <- str_replace_all(readLines("~/Desktop/test.txt"), "\\s(?=/)|(?<=/)\\s", "") %>% 
  .[which(nchar(.)>0)] %>% 
  str_split(pattern = "\\s+") %>% 
  map(., ~paste(.x, collapse = ",")) %>% 
  unlist

read.csv(textConnection(text))
#>           precision recall f1.score support
#> 0              0.71   0.83     0.77    1078
#> 1              0.76   0.61     0.67     931
#> avg/total      0.73   0.73     0.72    2009

Created on 2018-09-20 by the reprex package (v0.2.0).

answered Sep 20 '18 at 15:34

AndS.

7,748
2
12
17

Thank you for going through the trouble. `textConnection()` is from which package ? Also as suggested before this, it is a lot of circus to fix in r. I am posting an alternative python to csv output here. – der_radler Sep 21 '18 at 07:43
Oh I definitely agree that it would be easier to fix initially. I just wanted to provide an alternative. textConnection is a base R function. – AndS. Sep 21 '18 at 12:19

score 0 · Answer 2 · answered Sep 21 '18 at 07:49

Since much simpler to have python output csv, i am posting an alternative here. Just in case if it is useful as even in python needs some work.

def report_to_csv(report, title):
    report_data = []
    lines = report.split('\n')

    # loop through the lines
    for line in lines[2:-3]:
        row = {}
        row_data = line.split('      ')
        row['class'] = row_data[1]
        row['precision'] = float(row_data[2])
        row['recall'] = float(row_data[3])
        row['f1_score'] = float(row_data[4])
        row['support'] = float(row_data[5])
        report_data.append(row)

    df = pd.DataFrame.from_dict(report_data)

    # read the final summary line
    line_data = lines[-2].split('     ')
    summary_dat = []
    row2 = {}
    row2['class'] = line_data[0]
    row2['precision'] = float(line_data[1])
    row2['recall'] = float(line_data[2])
    row2['f1_score'] = float(line_data[3])
    row2['support'] = float(line_data[4])
    summary_dat.append(row2)

    summary_df = pd.DataFrame.from_dict(summary_dat)

    # concatenate both df. 
    report_final = pd.concat([df,summary_df], axis=0)
    report_final.to_csv(title+'cm_report.csv', index = False)

Function inspired from this solution

strsplit() output as a dataframe in r

2 Answers2