3

I have been trying to scrape a dynamic field of an HTML page using lxml The code is pretty simple and is below:

from lxml import html
import requests
page = requests.get('http://www.airmilescalculator.com/distance/blr-to-cdg/')
tree = html.fromstring(page.content)
miles = tree.xpath('//input[@class="distanceinput2"]/text()')
print miles

The result that I derive is just an empty list [] The result is expected to be a number in the list. However I am able to scrape static fields of the same page.

Thanks in advance for the help.

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
Tauseef Hussain
  • 1,049
  • 4
  • 15
  • 29

3 Answers3

3

you can't select text nodes from input fields because there is no text node.

<input type="text" class="distanceinput2" .. />

To get value from an input field use:

miles = [node.value for node in tree.xpath('//input[@class="distanceinput2"]')]

and you should get them.

The desired values are computed so we need to visit the page and simulate a Click to get them.splinter package is made for that.

from pyvirtualdisplay import Display
display = Display(visible=0)
display.start()

from splinter import Browser

url = 'http://www.airmilescalculator.com/distance/blr-to-cdg/'

browser = Browser()
browser.visit(url)
browser.find_by_id('haemulti')[0].click()

print browser.find_by_id('totaldistancemiles')[0].value
print browser.find_by_id('totaldistancekm')[0].value
print browser.find_by_id('nauticalmiles')[0].value

browser.quit()


display.stop()

pyvirtualdisplay is used to hide the browser.

OUTPUT:

$python test.py 
4868
7834
4230
Kenly
  • 24,317
  • 7
  • 44
  • 60
2

The issue here is that the value in the textbox is added by javascript. When the page loads the value in the text field is 0. So, even if you scrape, you won't get the value as the scraped content gets this

<input class="distanceinput2" id="totaldistancemiles" name="totaldistancemiles" readonly="readonly" size="5" title="Distance in miles" type="text" value="0"/>
<input class="distanceinput2" id="totaldistancekm" name="totaldistancekm" readonly="readonly" size="5" title="Distance in kilometers" type="text" value="0"/>
<input class="distanceinput2" id="nauticalmiles" name="nauticalmiles" readonly="readonly" size="5" title="Distance in nautical miles" type="text" value="0"/>

So, if you want to get the value as on the website, it is not possible by scraping.

You could try phantom JS, which acts like a headless browser. Haven't experimented with it but looks like there is a chance. Here is a link that could help.

Hope that helps!

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
hkasera
  • 2,118
  • 3
  • 23
  • 32
  • Yeah I just figured out that. The data is coming from an ajax request. Any idea of how to go about this? Any other library that I could use? – Tauseef Hussain Feb 04 '16 at 17:11
  • Try phantom JS (https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/) – hkasera Feb 04 '16 at 17:13
2

As you've already figured out, the distance is dynamically calculated from the results of the XHR call to the Google Maps API. This would not be easy to simulate/repeat with requests only, since you would, at least, need a Javascript Engine that a real browser has.

Here is how you can solve it via selenium and headless PhantomJS browser:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://www.airmilescalculator.com/distance/blr-to-cdg/")

distance = driver.find_element_by_id("totaldistancemilestext").text
print(distance)

Prints 4868.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you so much. Im not sure how to go about the `phantomJS` I het an error while trying to run the code `selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable n eeds to be in PATH.` Is there something that I am missing on? – Tauseef Hussain Feb 05 '16 at 09:18
  • @TauseefHussain `PhantomJS` must be somewhere on PATH. Or you can provide the path explicitly, sample: http://stackoverflow.com/a/34840307/771848. – alecxe Feb 05 '16 at 13:05
  • Works like magic :) Thank you so much! Im gonna have to further dig into PhantomJS to learn more. – Tauseef Hussain Feb 05 '16 at 14:43