My Scrapy spider is hosted at scrapinghub. It is managed via run spider API call. The only thing that changes in spider from call to call is a list of start urls. The list may vary from 100 urls to couple thousand. What is the best way to update start urls in this scenario? From what I see there is no direct option in SH API for this. I am thinking of updating MySql with list of urls and once updated send simple Run job API call. (Start urls will be generated from MySql table). Any comments on such solution or other options?
My current setup is as follows.
def __init__(self, startUrls, *args, **kwargs):
self.keywords = ['sales','advertise','contact','about','policy','terms','feedback','support','faq']
self.startUrls = startUrls
self.startUrls = json.loads(self.startUrls)
super(MySpider, self).__init__(*args, **kwargs)
def start_requests(self):
for url in self.startUrls:
yield Request(url=url)