Many python html parser like lxml
, pyquery
don't know which element is display:none
or not, .text()
function often get comment or something else in <script>
, and so on bugs.
such as below html:
<tr class="" rel="21261811">
<td class="leftborder timestamp" rel="1451972762"><span class="updatets ">
14secs</span></td>
<td><span><style>
.yc_S{display:none}
.qDiU{display:inline}
.VXrU{display:none}
.JoAO{display:inline}
.DEIS{display:none}
.p6YU{display:inline}
</style><span style="display:none">132</span><div style="display:none">132</div><span class="95">189</span><span class="yc_S">243</span><div style="display:none">243</div>.<span class="DEIS">70</span><div style="display:none">70</div><span></span><span class="VXrU">208</span><span></span><span style="display: inline">219</span><span style="display: inline">.</span><span style="display:none">131</span><span class="VXrU">131</span><span style="display: inline">210</span><span class="115">.</span><span style="display: inline">153</span></span></td>
<td>
10000</td>
<td style="text-align:left" rel="mx"><span class="country" style="white-space:nowrap;">
<img src=" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-mx" alt="flag ">Mexico</span>
</td>
<td> <div class="progress-indicator" style="width: 114px" value="1839" levels="speed" rel="1839">
<div class="indicator" style="width: 82%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>
<div class="progress-indicator " style="width: 114px" title="" rel="271" value="271" levels="speed">
<div class="indicator" style="width: 95%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>socks4/5</td>
<td nowrap="">High +KA</td>
</tr>
The $('tr').text()
output should be a IP address.But ugly result in lxml or pyquery.(Actually $('tr').text()
in Jquery can not get the IP too, need some Jquery basic filter
,which lxml or pyquery don't support)
For now, I have to use selenium build-in selector to avoid above problem.
But I'd like to find a ultimate way to solve such problem.
The way using Jquery
to parse html.
After some searching, I only find pyv8
can eval javascript in python.
I'd like to use a phantomjs broker to get html, and let jquery to process it.
But pyv8
's docs is so few that I can't catch the point.
Any example or other solution is welcome, thank you!