2

http://en.wikipedia.org/wiki/List_of_cities_in_China

I want to extract all city names as shown below:

enter image description here

I use the following code (for only extract one field), where xpath is copy from chrome

from lxml import html
import requests

page = requests.get('http://en.wikipedia.org/wiki/List_of_cities_in_China')
tree = html.fromstring(page.text)

huabeiTree=tree.xpath('//*[@id="mw-content-text"]/table[3]/tbody/tr[1]/td[1]/a/text()')
print huabeiTree

Nothing appears.

My ultimate goal is to extract all cities in the list, may I know how to achieve this?

william007
  • 17,375
  • 25
  • 118
  • 194

1 Answers1

1
from lxml import html
import requests

page = requests.get('http://en.wikipedia.org/wiki/List_of_cities_in_China')
tree = html.fromstring(page.text)

huabeiTree=tree.xpath('//table[@class="wikitable sortable"]')
list_of_cities_table = huabeiTree[0] # table[0] is what we need

# Iterate over the table, get all the <tr> nodes
#then get the values of cities with tr[0][0].text
for tr in list_of_cities_table:
    if tr[0].tag == 'td':
        print tr[0][0].text

It prints a list of 656 cities, starting from Beijing till Zhuji.

P.S. Maybe this is not so elegant. Could be improved with better Xpath expression.

sk11
  • 1,779
  • 1
  • 17
  • 29