parsing URL to generate delimited file

Question

I want to parse a URL,partial output of which comes as below html lines by using basic urllib2 and .read()

<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
< table border="1" cellpadding="5" cellspacing="0">
< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
< br>
< hr>

Now to get meaningful information from above HTML I trying to use HTMLParser ( other options showed after research like Beautifulsoup,lxml,pyquery not available in my environment and i dont have sudo to install it)

Expected output is a file with some delimiter, say comma

Running Map Tasks,31
Running Reduce Tasks,0
Total Submissions,5587
Nodes,8
Occupied Map Slots,31
Occupied Reduce Slots,0 
Reserved Map Slots,0 
Reserved Reduce Slots,0 
Map Task Capacity,352 
Reduce Task Capacity,128 
Avg. Tasks/Node,60.00 
Blacklisted Nodes,0 
Excluded Nodes,0 
MapTask Prefetch Capacity ,0

Please check

***********updates******** If i go for other options like beautifulsoup, will it allow to limit search for particular block e.g. Cluster Summary, as my html will have different sections

No need for sudo to install things locally. – Daniel Roseman Feb 16 '15 at 22:03 — Daniel Roseman, Feb 16 '15 at 22:03
ok, but not able to write to lib directory while installing – itsavy Feb 16 '15 at 22:10 — itsavy, Feb 16 '15 at 22:10

Jasper · Answer 1 · 2015-02-17T08:59:18.287

If you really have to/want to do this "by hand", you can resort to string manipulation and/or regular expressions. Assuming that you found the relevant lines by iterating over the HTML:

import re

line1 = "< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</           th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</          th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved     Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</        th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</    th><th>MapTask Prefetch Capacity</th></tr>"

line2 = '< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?    type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</     td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</ a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></  table>'


headers= (line1.replace('< tr>', '').
  replace("</tr>", "").
  replace("<th>", "").
  split("</th>"))

matches = re.findall("<td>(.+?)</td>", line2)
clean_matches = []
for m in matches:
    if m.startswith('<'):
        clean_matches.append(re.search('>(.+?)<', m).group(1))
    else:
        clean_matches.append(m)

for h, m in zip(headers, clean_matches):
    print("{}: {}".format(h, m))

For the first line, I'm using .replace() and .split() to get rid of the tags and split at the right places. For the "data" line, I'm using regular expressions to get the <td>'s content. If the content starts with a tag, the regex searches for the first text node in the <td>. As always, this code is quite brittle and can easily break, if the server formats the output only a little different.

If you are getting zero length field name errors, you are using an old Python version. Assuming that you can't change this, you have to change the print function/statement to either use {0}: {1} or print "%s: %s" % (h, m).

See also the python documentation on string formatting.

Thanks Jasper print("{}: {}".format(h, m)) ValueError: zero length field name in format working on this — itsavy, Feb 17 '15 at 01:08
http://stackoverflow.com/questions/5446964/valueerror-zero-length-field-name-in-format-error-in-python-3-0-3-1-3-2 you seem zo be using python <2.7. I'll update the answer. — Jasper, Feb 17 '15 at 08:48

parsing URL to generate delimited file

1 Answers1