0

Here is a scraper I created using Python on ScraperWiki:

import lxml.html
import re
import scraperwiki

pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
    if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
        data = {
            'arwu_rank'  : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
            'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
        }
    # DEBUG BEGIN
    if not type(data["arwu_rank"]) is str:
        print type(data["arwu_rank"])
        print data["arwu_rank"]
        print data["university"]
    # DEBUG END
    if "-" in data["arwu_rank"]:
        arwu_rank_bounds  = data["arwu_rank"].split("-")
        data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
    if not type(data["arwu_rank"]) is int:
        data["arwu_rank"] = int(data["arwu_rank"])
    scraperwiki.sqlite.save(unique_keys=['university'], data=data)

It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.

I have two questions:

  1. What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
  2. Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?

EDIT 6 May 2013, 20:07h UTC

The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:

import lxml.html
import re
import scraperwiki

pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
    if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
        data = {
            'arwu_rank'  : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
            'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
        }
        # DEBUG BEGIN
        if not type(data["arwu_rank"]) is str:
            print type(data["arwu_rank"])
            print data["arwu_rank"]
            print data["university"]
        # DEBUG END
        if "-" in data["arwu_rank"]:
            arwu_rank_bounds  = data["arwu_rank"].split("-")
            data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
        if not type(data["arwu_rank"]) is int:
            data["arwu_rank"] = int(data["arwu_rank"])
        scraperwiki.sqlite.save(unique_keys=['university'], data=data)

1 Answers1

2

There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.

I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data

if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:

doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.

Ross
  • 406
  • 2
  • 4
  • Thanks for the info about debugging. I'm not sure about your other point, though. IIUC, if that conditional is not triggered for the York University line, then `data["university"]` should never be assigned the value "York University"; yet on that last iteration, `data["university"]` *is* assigned the value "York University". –  May 06 '13 at 19:55
  • @sampablokuper This is likely because there is a row at the bottom of the table with the text "Institutions within the same rank range are listed alphabetically." and as you don't reset 'data' it will actually show York University twice, the second time with the pre-computed value form the previous run through that loop. Does that make sense? – Ross May 07 '13 at 07:16
  • Right you are. In particular, that happens due to the incorrect nesting I used in the first version of the code above. Thanks :) –  May 07 '13 at 11:01