Scraping links from more than one URL

Question

I'm using ScraperWiki to pull in links from the london-gazette.co.uk site. How would I edit the code so that I can paste in a number of separate search URLs at the bottom which are all collated into the same datastore?

At the moment I can just paste in the new URL, hit run, and the new data is added on to the back of the old data, but I was wondering if there's a way to speed things up and get the scraper to work on several URLs at once? I would be changing the 'notice code' part of the URLs: issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1

Sorry - new to Stack Overflow and my coding knowledge is pretty much non existent, but the code is here: https://scraperwiki.com/scrapers/links_1/edit/

Is my answer what you wanted or are you looking for something else? — Suzana, Apr 21 '13 at 00:01
Sorry, had somehow turned email notifications off. Thanks, but it didn't work. It worked for your example scraper, but when I tried to adapt it to change the notice code section of the URL I got nowhere. — Henry Taylor, Apr 21 '13 at 08:54
Sure, the start of every URL is http://www.london-gazette.co.uk/ Then I'd be looking at scraping URLs that vary by notice code and date, like such: issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1 / issues/2013-01-15;2013-01-15/all=NoticeCode%3a2453/start=1 / issues/2013-01-15;2013-01-15/all=NoticeCode%3a2462/start=1 / issues/2012-02-10;2013-02-20/all=NoticeCode%3a2441/start=1 — Henry Taylor, Apr 24 '13 at 09:57

Suzana · Accepted Answer · 2013-04-24T12:54:15.800

The scraper you linked to seems to be empty, but I had a look at the original scraper by Rebecca Ratcliffe. If yours is the same, you only have to put your URLs into a list and loop through them with a for-loop:

urls = ['/issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1', 
'/issues /2013-01-15;2013-01-15/all=NoticeCode%3a2453/start=1',
'/issues/2013-01-15;2013-01-15/all=NoticeCode%3a2462/start=1', 
'/issues/2012-02-10;2013-02-20/all=NoticeCode%3a2441/start=1']

base_url = 'http://www.london-gazette.co.uk'
for u in urls:
    starting_url = urlparse.urljoin(base_url, u)
    scrape_and_look_for_next_link(starting_url)

Just have a look at this scraper that I copied and adapted accordingly.

Scraping links from more than one URL

1 Answers1