2
import requests
from lxml import html

page = requests.get(url="http://www.cia.gov/library/publications/the-world-factbook/geos/ch.html")
tree = html.fromstring(page.content)

bordering = tree.xpath('//*[@id="wfb_data"]/table/tr[4]/td/ul[3]/li[4]/div[17]/span[2]/text()')
print bordering

I retrieved the xPath using chrome developer mode, but it is still giving me an empty "bordering" variable. I'm at a loss for what could be wrong.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195

2 Answers2

3

First of all, you need to use https and not http:

https://www.cia.gov/library/publications/the-world-factbook/geos/ch.html

Also, there is a simpler way to get to the bordering data - find the span containing border countries text and get the next sibling's text:

bordering = tree.xpath('//*[@id="wfb_data"]//span[starts-with(., "border countries")]/following-sibling::span')[0]
print(bordering.text_content())

Prints:

Afghanistan 91 km, Bhutan 477 km, Burma 2,129 km, India 2,659 km, Kazakhstan 1,765 km, North Korea 1,352 km, Kyrgyzstan 1,063 km, Laos 475 km, Mongolia 4,630 km, Nepal 1,389 km, Pakistan 438 km, Russia (northeast) 4,133 km, Russia (northwest) 46 km, Tajikistan 477 km, Vietnam 1,297 km
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

Please check by using User-Agent in Requests.

headers ={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
    page = requests.get(url , headers=headers,timeout=5,  verify=False)

Please let me know if this works.

Thanks.