Reading a complicated table with pandas ('task-spooler')

Question

I have the following table, which is the output of task-spooler.

Its easy for humans to parse, but I am having trouble reading it into a pandas DF.

Any idea?

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo
9    finished   /tmp/ts-out.8ewxzl   0        0.01/0.00/0.00 echo
10   finished   /tmp/ts-out.ahSLaY   0        0.00/0.00/0.00 bash -c echo $GPUID
11   finished   /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0        0.00/0.00/0.00 bash -c ls
12   finished   /tmp/ts-out.ADWkve   0        0.00/0.00/0.00 bash -c ls
13   finished   /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1       130.67/0.00/0.02 bash -c python infloop.py
14   finished   /tmp/ts-out.HxBqkm   0        0.00/0.00/0.00 bash -c echo 11
15   finished   /tmp/ts-out.ERNuaE   0        0.00/0.00/0.00 bash -c echo 
16   finished   /tmp/ts-out.9j6hkS   0        0.00/0.00/0.00 bash -c echo $GPUID
17   finished   /tmp/ts-out.Y5QDNa   0        0.00/0.00/0.00 bash -c echo $GPUID
18   finished   /tmp/ts-out.EIHhoX   -1       0.00/0.00/0.00 %s
19   finished   /tmp/ts-out.LLw2Wl   -1       0.00/0.00/0.00 
20   finished   /tmp/ts-out.deWAJR   -1       0.01/0.00/0.00 echo $GPUID
21   finished   /tmp/ts-out.AdZFIf   -1       0.00/0.00/0.00 echo 12
22   finished   /tmp/ts-out.NBOCVv   0        0.00/0.00/0.00 echo 12
23   finished   /tmp/ts-out.5WpfPu   0        0.00/0.00/0.00 echo
24   finished   /tmp/ts-out.1lw4bS   -1       0.00/0.00/0.00 echo 
25   finished   /tmp/ts-out.7MNGLQ   0        0.00/0.00/0.00 bash -c echo $GPUID
26   finished   /tmp/ts-out.8FZ3on   0        0.00/0.00/0.00 bash -c echo $GPUID

My best try was:

from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')

Error:

ValueError: Expected 7 fields in line 2, saw 9

EDIT: The source code that generates the table is available. Here are the commands to generate each line. Can this assist in reading the table to a dataframe?

if (p->label)
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
            "%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->label,
            p->command);
else
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->command);

@EdChum, no. Using `\t` put all the columns to a single column — Yuval Atzmon, Feb 13 '17 at 10:25
What about `df = pd.read_csv('file', sep=r'\s{2,}', engine='python')` ? - separator is regex - `2 and more whitespaces` — jezrael, Feb 13 '17 at 10:26
@jezrael, thanks. It reads the table, correctly for the first 3 columns, but then do it wrong for the next columns — Yuval Atzmon, Feb 13 '17 at 10:33

score 0 · Answer 1 · answered Feb 13 '17 at 11:29

It's kind of annoying but since the separators are not consistent in the output log (sometimes multiple spaces, sometimes tabs and in the last column usually just one space) it's hard to parse without any additional logic applied to the file before parsing it with pandas. I personally don't like opening the file in python to fix it and then load it with pandas, so I would just add a short sed command to my pipeline before loading the file in python (which is very simple if you're using linux and if the log text is loaded from a file). You can add:

cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv

Then you just replace all spaces with commas as well as the last, problematic space. The text then turns from:

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo

To this:

ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo

And then load it in pandas as CSV:

import pandas as pd
my_df = pd.read_csv(my_log_file)

I am sorry that it's not a fun pure python solution but the bash parts makes the python part much easier, in my opinion.

Thanks, I am ok with `sed` as long as it works. However your solution still does not solve the incosistant spaces issue. See for example your 2nd csv line, ideally there should be additional command before `[l1]python infloop.py` — Yuval Atzmon, Feb 13 '17 at 11:36
Sorry, for some reason I thought that row misses the last columns and not two columns in the middle. — Moran Neuhof, Feb 13 '17 at 11:46
This works better, by the way: `cat logfile.log | sed -e 's/\s\{2,12\}/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv` It takes into consideration that too many spaces actually mean more than one comma. This is the output: `ID,State,Output,,E-Level,Times(r/u/s),Command [run=1/2]` `6,running,/tmp/ts-out.FzVneG,,,[l1]python infloop.py` `0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1` `1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1` — Moran Neuhof, Feb 13 '17 at 11:55
Toda ;) It almost works! There is an additional column at the middle `Unamed: 3`, and the results are shifted from that column forward. While the `Command` column is always with NaNs. ` ID State Output Unnamed: 3 E-Level Times(r/u/s) Command [run=0/2]`. I am not familiar with the `sed` syntax, any idea how to fix that? — Yuval Atzmon, Feb 13 '17 at 12:37
Hmmm... `cat logfile.log | sed -e 's/\s\{2,8\}/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' | sed -e 's/,\([[:digit:]]\+.[[:digit:]]\{2\}\)/,,\1/' > logfile.csv` Just added another comma before the time column. I'm sorry it's ended up so ugly :) — Moran Neuhof, Feb 13 '17 at 13:12
Thanks, but still it doesn't catches all the different cases. Nevermind, I found another (more conventional) way to extract the specific information I need without pandas — Yuval Atzmon, Feb 13 '17 at 13:30

Reading a complicated table with pandas ('task-spooler')

1 Answers1