2

Many python html parser like lxml, pyquery don't know which element is display:none or not, .text() function often get comment or something else in <script>, and so on bugs.

such as below html:

<tr class="" rel="21261811">
         <td class="leftborder timestamp" rel="1451972762"><span class="updatets ">
14secs</span></td>
         <td><span><style>
.yc_S{display:none}
.qDiU{display:inline}
.VXrU{display:none}
.JoAO{display:inline}
.DEIS{display:none}
.p6YU{display:inline}
</style><span style="display:none">132</span><div style="display:none">132</div><span class="95">189</span><span class="yc_S">243</span><div style="display:none">243</div>.<span class="DEIS">70</span><div style="display:none">70</div><span></span><span class="VXrU">208</span><span></span><span style="display: inline">219</span><span style="display: inline">.</span><span style="display:none">131</span><span class="VXrU">131</span><span style="display: inline">210</span><span class="115">.</span><span style="display: inline">153</span></span></td> 
         <td>
10000</td>
             <td style="text-align:left" rel="mx"><span class="country" style="white-space:nowrap;">



<img src=" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-mx" alt="flag ">Mexico</span>
</td>

         <td> <div class="progress-indicator" style="width: 114px" value="1839" levels="speed" rel="1839">
        <div class="indicator" style="width: 82%; background-color: rgb(0, 173, 173)"></div>
        </div>
         </td>
             <td> 
                 <div class="progress-indicator " style="width: 114px" title="" rel="271" value="271" levels="speed">
                            <div class="indicator" style="width: 95%; background-color: rgb(0, 173, 173)"></div>

        </div>
             </td>

             <td>socks4/5</td>
             <td nowrap="">High +KA</td>

         </tr>

The $('tr').text() output should be a IP address.But ugly result in lxml or pyquery.(Actually $('tr').text() in Jquery can not get the IP too, need some Jquery basic filter,which lxml or pyquery don't support)

For now, I have to use selenium build-in selector to avoid above problem. But I'd like to find a ultimate way to solve such problem. The way using Jquery to parse html.

After some searching, I only find pyv8 can eval javascript in python.

I'd like to use a phantomjs broker to get html, and let jquery to process it.

But pyv8's docs is so few that I can't catch the point.

Any example or other solution is welcome, thank you!

Mithril
  • 12,947
  • 18
  • 102
  • 153
  • just to get IP address? – YOU Jan 11 '16 at 09:55
  • @YOU yes, actually `$('tr').text()` in `Jquery` can not get the IP too, need some `Jquery basic filter`,which lxml or pyquery don't support. – Mithril Jan 12 '16 at 02:08
  • Your question does not seem very clear to me. What are you actually doing here, because Python is just a programming language. Are you serving web content with Python? In that case python itself is irrelevant, it's all about your HTML, CSS, and JS doing what it needs to do. Are you using offline/server-side templating to form data? In that case the JS is irrelevant because something like jinja2 is going to do what you need. Is this for testing? Because in that case using a real, headless browser is *exactly* what you want. Can you change your question to be better descriptive? – Mike 'Pomax' Kamermans Jan 12 '16 at 02:14
  • @Mike'Pomax'Kamermans I am crawling a web site. – Mithril Jan 12 '16 at 02:18
  • jQuery, or any javascript approach to parse something, is a bad idea IMO. Have you tried with scrapy? – Masacroso Jan 12 '16 at 02:25
  • @Masacroso please see question carefully,and you can see my another question http://stackoverflow.com/questions/34303054/any-way-to-tell-selenium-dont-execute-js-at-some-point, scrapy can not cope with this, it focus on crawling not parsing. My goal is just to extract content with most easily tool as jquery. – Mithril Jan 12 '16 at 02:29
  • if it doesn't work, it's not the easiest tool. If you want to crawl a website, use a crawler. Heck in this case I'd say just use PHP with http://simplehtmldom.sourceforge.net or something instead. Just as little effort to install PHP, guaranteed works, just copy the examples. – Mike 'Pomax' Kamermans Jan 12 '16 at 06:24
  • @Mike 'Pomax' Kamermans I'd like to done it just in python, pack as a package would benefit a lot.It does overkill for just one site, but reduce a lot of work in future with much more other sites.Plus, I have a distribute crawler system, this is what I most need now. – Mithril Jan 12 '16 at 06:44
  • @Mike 'Pomax' Kamermans And I see the php code, it is just a html parser without a js v8 engine, it act as lxml or pyquery in python, which don't know a element is visiable or not (because the css file do not be excute). – Mithril Jan 12 '16 at 06:50
  • Then use a headless browser. That's what they're for. – Mike 'Pomax' Kamermans Jan 12 '16 at 16:19
  • @Mike 'Pomax' Kamermans phantomjs does, and it also should be achieved by `pyv8` . – Mithril Jan 13 '16 at 00:52

1 Answers1

0

For now, I build a phantomjs broker which support jquery. The flow is:

  1. python send url and js code(jquery code) to phantomjs
  2. phantomjs(init with jquery injected) get html, and run js code
  3. return js code result
  4. python get the result

more detail can see phantomjs inject jquery.

But it is not running jquery in python in fact, depend on external help.

It's the current best solution I could use. Welcome to post a different approach.

Community
  • 1
  • 1
Mithril
  • 12,947
  • 18
  • 102
  • 153
  • @Mike 'Pomax' Kamermans It's a currentlly optional answer for my question,and I wouldn't accept this until I or someone else post the best way. – Mithril Jan 12 '16 at 06:35
  • @Mike 'Pomax' Kamermans I have to say I didn't know phantomjs can embed `jquery` before I post the question. This way is acceptable for me(and someones with such question too), and no one can deny it isn't the best one solution for my question now.But I'd like to see a more independent way, that's why I wouldn't accept this. – Mithril Jan 13 '16 at 00:59
  • @Mike 'Pomax' Kamermans phantomjs can inject local jquery, see the link in this answer. – Mithril Jan 13 '16 at 01:39