Scraping dynamic html fields with lxml

Question

I have been trying to scrape a dynamic field of an HTML page using lxml The code is pretty simple and is below:

from lxml import html
import requests
page = requests.get('http://www.airmilescalculator.com/distance/blr-to-cdg/')
tree = html.fromstring(page.content)
miles = tree.xpath('//input[@class="distanceinput2"]/text()')
print miles

The result that I derive is just an empty list [] The result is expected to be a number in the list. However I am able to scrape static fields of the same page.

Thanks in advance for the help.

Kenly · Answer 1 · 2016-02-04T21:29:59.837

you can't select text nodes from input fields because there is no text node.

<input type="text" class="distanceinput2" .. />

To get value from an input field use:

miles = [node.value for node in tree.xpath('//input[@class="distanceinput2"]')]

and you should get them.

The desired values are computed so we need to visit the page and simulate a Click to get them.splinter package is made for that.

from pyvirtualdisplay import Display
display = Display(visible=0)
display.start()

from splinter import Browser

url = 'http://www.airmilescalculator.com/distance/blr-to-cdg/'

browser = Browser()
browser.visit(url)
browser.find_by_id('haemulti')[0].click()

print browser.find_by_id('totaldistancemiles')[0].value
print browser.find_by_id('totaldistancekm')[0].value
print browser.find_by_id('nauticalmiles')[0].value

browser.quit()


display.stop()

pyvirtualdisplay is used to hide the browser.

OUTPUT:

$python test.py 
4868
7834
4230

As mentioned in the above answer the data is coming from an Ajax request. Is there a way I could scrap Ajax loaded fields? — Tauseef Hussain, Feb 04 '16 at 17:12

score 2 · Answer 2 · edited Feb 05 '16 at 04:02

The issue here is that the value in the textbox is added by javascript. When the page loads the value in the text field is 0. So, even if you scrape, you won't get the value as the scraped content gets this

<input class="distanceinput2" id="totaldistancemiles" name="totaldistancemiles" readonly="readonly" size="5" title="Distance in miles" type="text" value="0"/>
<input class="distanceinput2" id="totaldistancekm" name="totaldistancekm" readonly="readonly" size="5" title="Distance in kilometers" type="text" value="0"/>
<input class="distanceinput2" id="nauticalmiles" name="nauticalmiles" readonly="readonly" size="5" title="Distance in nautical miles" type="text" value="0"/>

So, if you want to get the value as on the website, it is not possible by scraping.

You could try phantom JS, which acts like a headless browser. Haven't experimented with it but looks like there is a chance. Here is a link that could help.

Hope that helps!

Yeah I just figured out that. The data is coming from an ajax request. Any idea of how to go about this? Any other library that I could use? — Tauseef Hussain, Feb 04 '16 at 17:11
Try phantom JS (https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/) — hkasera, Feb 04 '16 at 17:13

score 2 · Accepted Answer · answered Feb 04 '16 at 18:55

2

As you've already figured out, the distance is dynamically calculated from the results of the XHR call to the Google Maps API. This would not be easy to simulate/repeat with requests only, since you would, at least, need a Javascript Engine that a real browser has.

Here is how you can solve it via selenium and headless PhantomJS browser:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://www.airmilescalculator.com/distance/blr-to-cdg/")

distance = driver.find_element_by_id("totaldistancemilestext").text
print(distance)

Prints 4868.

answered Feb 04 '16 at 18:55

alecxe

462,703
120
1,088
1,195

Thank you so much. Im not sure how to go about the `phantomJS` I het an error while trying to run the code `selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable n eeds to be in PATH.` Is there something that I am missing on? – Tauseef Hussain Feb 05 '16 at 09:18
@TauseefHussain `PhantomJS` must be somewhere on PATH. Or you can provide the path explicitly, sample: http://stackoverflow.com/a/34840307/771848. – alecxe Feb 05 '16 at 13:05
Works like magic :) Thank you so much! Im gonna have to further dig into PhantomJS to learn more. – Tauseef Hussain Feb 05 '16 at 14:43

Scraping dynamic html fields with lxml

3 Answers3