How to use real jquery in python to extract data from a web page?

Question

Many python html parser like lxml, pyquery don't know which element is display:none or not, .text() function often get comment or something else in <script>, and so on bugs.

such as below html:

<tr class="" rel="21261811">
         <td class="leftborder timestamp" rel="1451972762"><span class="updatets ">
14secs</span></td>
         <td><span><style>
.yc_S{display:none}
.qDiU{display:inline}
.VXrU{display:none}
.JoAO{display:inline}
.DEIS{display:none}
.p6YU{display:inline}
</style><span style="display:none">132</span><div style="display:none">132</div><span class="95">189</span><span class="yc_S">243</span><div style="display:none">243</div>.<span class="DEIS">70</span><div style="display:none">70</div><span></span><span class="VXrU">208</span><span></span><span style="display: inline">219</span><span style="display: inline">.</span><span style="display:none">131</span><span class="VXrU">131</span><span style="display: inline">210</span><span class="115">.</span><span style="display: inline">153</span></span></td> 
         <td>
10000</td>
             <td style="text-align:left" rel="mx"><span class="country" style="white-space:nowrap;">



<img src=" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-mx" alt="flag ">Mexico</span>
</td>

         <td> <div class="progress-indicator" style="width: 114px" value="1839" levels="speed" rel="1839">
        <div class="indicator" style="width: 82%; background-color: rgb(0, 173, 173)"></div>
        </div>
         </td>
             <td> 
                 <div class="progress-indicator " style="width: 114px" title="" rel="271" value="271" levels="speed">
                            <div class="indicator" style="width: 95%; background-color: rgb(0, 173, 173)"></div>

        </div>
             </td>

             <td>socks4/5</td>
             <td nowrap="">High +KA</td>

         </tr>

The $('tr').text() output should be a IP address.But ugly result in lxml or pyquery.(Actually $('tr').text() in Jquery can not get the IP too, need some Jquery basic filter,which lxml or pyquery don't support)

For now, I have to use selenium build-in selector to avoid above problem. But I'd like to find a ultimate way to solve such problem. The way using Jquery to parse html.

After some searching, I only find pyv8 can eval javascript in python.

I'd like to use a phantomjs broker to get html, and let jquery to process it.

But pyv8's docs is so few that I can't catch the point.

Any example or other solution is welcome, thank you!

@YOU yes, actually `$('tr').text()` in `Jquery` can not get the IP too, need some `Jquery basic filter`,which lxml or pyquery don't support. — Mithril, Jan 12 '16 at 02:08
Your question does not seem very clear to me. What are you actually doing here, because Python is just a programming language. Are you serving web content with Python? In that case python itself is irrelevant, it's all about your HTML, CSS, and JS doing what it needs to do. Are you using offline/server-side templating to form data? In that case the JS is irrelevant because something like jinja2 is going to do what you need. Is this for testing? Because in that case using a real, headless browser is *exactly* what you want. Can you change your question to be better descriptive? — Mike 'Pomax' Kamermans, Jan 12 '16 at 02:14
jQuery, or any javascript approach to parse something, is a bad idea IMO. Have you tried with scrapy? — Masacroso, Jan 12 '16 at 02:25
@Masacroso please see question carefully,and you can see my another question http://stackoverflow.com/questions/34303054/any-way-to-tell-selenium-dont-execute-js-at-some-point, scrapy can not cope with this, it focus on crawling not parsing. My goal is just to extract content with most easily tool as jquery. — Mithril, Jan 12 '16 at 02:29
if it doesn't work, it's not the easiest tool. If you want to crawl a website, use a crawler. Heck in this case I'd say just use PHP with http://simplehtmldom.sourceforge.net or something instead. Just as little effort to install PHP, guaranteed works, just copy the examples. — Mike 'Pomax' Kamermans, Jan 12 '16 at 06:24
@Mike 'Pomax' Kamermans I'd like to done it just in python, pack as a package would benefit a lot.It does overkill for just one site, but reduce a lot of work in future with much more other sites.Plus, I have a distribute crawler system, this is what I most need now. — Mithril, Jan 12 '16 at 06:44
@Mike 'Pomax' Kamermans And I see the php code, it is just a html parser without a js v8 engine, it act as lxml or pyquery in python, which don't know a element is visiable or not (because the css file do not be excute). — Mithril, Jan 12 '16 at 06:50
@Mike 'Pomax' Kamermans phantomjs does, and it also should be achieved by `pyv8` . — Mithril, Jan 13 '16 at 00:52

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

For now, I build a phantomjs broker which support jquery. The flow is:

python send url and js code(jquery code) to phantomjs
phantomjs(init with jquery injected) get html, and run js code
return js code result
python get the result

more detail can see phantomjs inject jquery.

But it is not running jquery in python in fact, depend on external help.

It's the current best solution I could use. Welcome to post a different approach.

edited May 23 '17 at 12:31

Community

1
1

answered Jan 12 '16 at 02:17

Mithril

12,947
18
102
153

@Mike 'Pomax' Kamermans It's a currentlly optional answer for my question,and I wouldn't accept this until I or someone else post the best way. – Mithril Jan 12 '16 at 06:35
@Mike 'Pomax' Kamermans I have to say I didn't know phantomjs can embed `jquery` before I post the question. This way is acceptable for me(and someones with such question too), and no one can deny it isn't the best one solution for my question now.But I'd like to see a more independent way, that's why I wouldn't accept this. – Mithril Jan 13 '16 at 00:59
@Mike 'Pomax' Kamermans phantomjs can inject local jquery, see the link in this answer. – Mithril Jan 13 '16 at 01:39

How to use real jquery in python to extract data from a web page?

1 Answers1