0

Here is a ScraperWiki scraper written in Python:

import lxml.html
import scraperwiki
from unidecode import unidecode

html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200")
root = lxml.html.fromstring(html)
for tr in root.cssselect("table.ranking tr"):
    if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0:
        university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title()
        if 'cole' in university:
            print university

It produces the following output:

Ecole Polytechnique Federale De Lausanne
Ecole Normale Superieure
Acole Polytechnique
Ecole Normale Superieure De Lyon

My question: what is causing the initial character on the third output line to be rendered as "A" rather than as "E", and how can I stop this from happening?

  • 1
    There's a difference between the ones coming out as Ecole and the one coming out as Acole. The Ecole ones are actually `École` while the one standing out is `École Polytechnique`, i.e. not an HTML entity. The break could be occurring in either `lxml` or in `unidecode`. Also make sure your terminal support the right encoding. – soulseekah May 07 '13 at 19:37
  • Right you are. Oddly, the Firefox inspector didn't show that difference. Now to try to figure out the solution. Incidentally, if you want to turn your comment into an answer, I'll gladly upvote it (and if it answers the second part of my question, then of course I'll also gladly mark it solved). –  May 07 '13 at 19:45

1 Answers1

2

Based on soulseekah's helpful comment above, and on the lxml docs here and here, the following solution works:

import lxml.html
import scraperwiki
from unidecode import unidecode
from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    return converted.unicode

html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200")
root = lxml.html.fromstring(decode_html(html))
for tr in root.cssselect("table.ranking tr"):
    if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0:
        university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title()
        if 'cole' in university:
            print university
Community
  • 1
  • 1