0

I'm using ScraperWiki to pull in links from the london-gazette.co.uk site. How would I edit the code so that I can paste in a number of separate search URLs at the bottom which are all collated into the same datastore?

At the moment I can just paste in the new URL, hit run, and the new data is added on to the back of the old data, but I was wondering if there's a way to speed things up and get the scraper to work on several URLs at once? I would be changing the 'notice code' part of the URLs: issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1

Sorry - new to Stack Overflow and my coding knowledge is pretty much non existent, but the code is here: https://scraperwiki.com/scrapers/links_1/edit/

karthikr
  • 97,368
  • 26
  • 197
  • 188
  • Is my answer what you wanted or are you looking for something else? – Suzana Apr 21 '13 at 00:01
  • Sorry, had somehow turned email notifications off. Thanks, but it didn't work. It worked for your example scraper, but when I tried to adapt it to change the notice code section of the URL I got nowhere. – Henry Taylor Apr 21 '13 at 08:54
  • Can you give an example list of URLs you want to scrape? – Suzana Apr 23 '13 at 15:41
  • Sure, the start of every URL is http://www.london-gazette.co.uk/ Then I'd be looking at scraping URLs that vary by notice code and date, like such: issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1 / issues/2013-01-15;2013-01-15/all=NoticeCode%3a2453/start=1 / issues/2013-01-15;2013-01-15/all=NoticeCode%3a2462/start=1 / issues/2012-02-10;2013-02-20/all=NoticeCode%3a2441/start=1 – Henry Taylor Apr 24 '13 at 09:57
  • These URLs work well with the code from my answer. – Suzana Apr 24 '13 at 12:57

1 Answers1

0

The scraper you linked to seems to be empty, but I had a look at the original scraper by Rebecca Ratcliffe. If yours is the same, you only have to put your URLs into a list and loop through them with a for-loop:

urls = ['/issues/2013-01-15;2013-01-15/all=NoticeCode%3a2441/start=1', 
'/issues /2013-01-15;2013-01-15/all=NoticeCode%3a2453/start=1',
'/issues/2013-01-15;2013-01-15/all=NoticeCode%3a2462/start=1', 
'/issues/2012-02-10;2013-02-20/all=NoticeCode%3a2441/start=1']

base_url = 'http://www.london-gazette.co.uk'
for u in urls:
    starting_url = urlparse.urljoin(base_url, u)
    scrape_and_look_for_next_link(starting_url)

Just have a look at this scraper that I copied and adapted accordingly.

Suzana
  • 4,251
  • 2
  • 28
  • 52