0

My Scrapy spider is hosted at scrapinghub. It is managed via run spider API call. The only thing that changes in spider from call to call is a list of start urls. The list may vary from 100 urls to couple thousand. What is the best way to update start urls in this scenario? From what I see there is no direct option in SH API for this. I am thinking of updating MySql with list of urls and once updated send simple Run job API call. (Start urls will be generated from MySql table). Any comments on such solution or other options?

My current setup is as follows.

def __init__(self, startUrls, *args, **kwargs):

    self.keywords = ['sales','advertise','contact','about','policy','terms','feedback','support','faq']

    self.startUrls = startUrls

    self.startUrls = json.loads(self.startUrls)

    super(MySpider, self).__init__(*args, **kwargs)

def start_requests(self):

    for url in self.startUrls:

        yield Request(url=url)
Billy Jhon
  • 1,035
  • 15
  • 30

1 Answers1

2

You can pass parameters to scrapy spider and read them inside your spider.

Send list of URLs encoded as JSON and then decode them, and now fire requests.

class MySpider(scrapy.Spider):

    def __init__(self, startUrls, *args, **kwargs):

        self.startUrls = startUrls

        self.startUrls = json.loads(self.startUrls)

        super(MySpider, self).__init__(*args, **kwargs)


    def start_requests(self):

        for url in self.startUrls:

            yield Request(url=url ... )

And here is how you run send this parameter to your spider.

curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER -d startUrls="JSON_ARRAY_OF_LINKS_HERE"

Your scrapinghub.yml file should be like this

projects:
  default: 160868
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Thank you. This whould work perfectly for a list of 10 urls. But how do I deal with 1000 urls? – Billy Jhon Nov 01 '17 at 12:15
  • what whould the keys in urls array be ? – Billy Jhon Nov 01 '17 at 12:29
  • @BillyJhon just create a list/array of URLs in whatever programming language you are using, and then encode it as JSON .. the encoded JSON will look like this ... http://www.jsoneditoronline.org/?id=256353f25851ab72ef689ece443dd071 ... in PHP, create an array of URLs, and then do `json_encode($array)` ... – Umair Ayub Nov 01 '17 at 12:32
  • Got it. Thanks. Marked your answer as accepted one. – Billy Jhon Nov 01 '17 at 12:34
  • Hey. I am getting this error when shub deploy. Error: Deploy failed (400): project: non_field_errors. Have added the code into original post. – Billy Jhon Nov 05 '17 at 10:01
  • @BillyJhon Show me your `scrapinghub.yml` and `requirements.txt` file – Umair Ayub Nov 05 '17 at 10:08
  • project: 252665. And there is no requirements txt for this one. – Billy Jhon Nov 05 '17 at 10:14
  • @BillyJhon See my edited answer, that is how your scrapinghub.yml file should look like – Umair Ayub Nov 05 '17 at 10:16
  • Hmm not sure whats going on ... please try to remove `project.egg-info/` & `build/` folders – Umair Ayub Nov 05 '17 at 10:21
  • Did it twice before every change. – Billy Jhon Nov 05 '17 at 10:22
  • I dont have further knowledge .. try `shub logout` and then again `shub login` ... and then deploy ... if that doesnt work too, try to update your `shub` by `pip install shub --upgrade` – Umair Ayub Nov 05 '17 at 10:24
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158252/discussion-between-billy-jhon-and-umair). – Billy Jhon Nov 05 '17 at 10:25
  • Just for reference: `yield Request(url=url ... )` returned `SyntaxError: invalid syntax`. I just removed the 3 dots and got it working. – Linkmichiel May 01 '20 at 08:01
  • @Linkmichiel dots were just for explanation of more code follows, they were not part of Python code – Umair Ayub May 01 '20 at 10:08