I want to parse a URL,partial output of which comes as below html lines by using basic urllib2 and .read()
<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
< table border="1" cellpadding="5" cellspacing="0">
< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
< br>
< hr>
Now to get meaningful information from above HTML I trying to use HTMLParser ( other options showed after research like Beautifulsoup,lxml,pyquery not available in my environment and i dont have sudo to install it)
Expected output is a file with some delimiter, say comma
Running Map Tasks,31
Running Reduce Tasks,0
Total Submissions,5587
Nodes,8
Occupied Map Slots,31
Occupied Reduce Slots,0
Reserved Map Slots,0
Reserved Reduce Slots,0
Map Task Capacity,352
Reduce Task Capacity,128
Avg. Tasks/Node,60.00
Blacklisted Nodes,0
Excluded Nodes,0
MapTask Prefetch Capacity ,0
Please check
***********updates******** If i go for other options like beautifulsoup, will it allow to limit search for particular block e.g. Cluster Summary, as my html will have different sections