Extracting all cities in Wikipedia

Question

http://en.wikipedia.org/wiki/List_of_cities_in_China

I want to extract all city names as shown below:

enter image description here

I use the following code (for only extract one field), where xpath is copy from chrome

from lxml import html
import requests

page = requests.get('http://en.wikipedia.org/wiki/List_of_cities_in_China')
tree = html.fromstring(page.text)

huabeiTree=tree.xpath('//*[@id="mw-content-text"]/table[3]/tbody/tr[1]/td[1]/a/text()')
print huabeiTree

Nothing appears.

My ultimate goal is to extract all cities in the list, may I know how to achieve this?

What is your goal! if you wanna get all the cities in China, there is an easier way to do that — user3378649, Oct 30 '14 at 07:26

score 1 · Accepted Answer · answered Oct 30 '14 at 08:43

from lxml import html
import requests

page = requests.get('http://en.wikipedia.org/wiki/List_of_cities_in_China')
tree = html.fromstring(page.text)

huabeiTree=tree.xpath('//table[@class="wikitable sortable"]')
list_of_cities_table = huabeiTree[0] # table[0] is what we need

# Iterate over the table, get all the <tr> nodes
#then get the values of cities with tr[0][0].text
for tr in list_of_cities_table:
    if tr[0].tag == 'td':
        print tr[0][0].text

It prints a list of 656 cities, starting from Beijing till Zhuji.

P.S. Maybe this is not so elegant. Could be improved with better Xpath expression.

Extracting all cities in Wikipedia

1 Answers1