I have a two separate spider ...
Spider 1 will get the list of URL from give HTML pages
Spider 2 will use the scraped URL in previous spider as a Start url and start scraping the pages
..now what i am trying to do is ...i am trying to schedule it in a way that....after every hour or so ..i want to fire all the spider 2 url in parallel, at the same time
i have deployed it on scrapyD and passing the start_url from python script to each deployed spider as argument..like
for url in start_urls:
r = requests.post("http://localhost:6800/schedule.json",
params={
'project': 'project',
'spider': 'spider',
'start_urls': url
})
and inside spider ,reading this argument ,start_urls, from kwargs and assigning it to Start_urls
but the thing that i have noticed is when i pass multiple URLS to the same deployed spider using for loop, it is never running in parallel
at any point of time only 1 job is running , other jobs are in pending stat(not running)
scrapyd and service settings are as they are by default only changed following two settings
max_proc = 100
max_proc_per_cpu = 25
how can i achieve , close to the real parallelism using python-scrapy-scrapyd
or will i have to go with python-multi processing-pool-apply_async or some other solution