3

I have a two separate spider ...

  1. Spider 1 will get the list of URL from give HTML pages

  2. Spider 2 will use the scraped URL in previous spider as a Start url and start scraping the pages

..now what i am trying to do is ...i am trying to schedule it in a way that....after every hour or so ..i want to fire all the spider 2 url in parallel, at the same time

i have deployed it on scrapyD and passing the start_url from python script to each deployed spider as argument..like

for url in start_urls:
r = requests.post("http://localhost:6800/schedule.json",
                  params={
                      'project': 'project',
                      'spider': 'spider',
                      'start_urls': url
                  })

and inside spider ,reading this argument ,start_urls, from kwargs and assigning it to Start_urls

but the thing that i have noticed is when i pass multiple URLS to the same deployed spider using for loop, it is never running in parallel

at any point of time only 1 job is running , other jobs are in pending stat(not running)

scrapyd and service settings are as they are by default only changed following two settings

max_proc    = 100
max_proc_per_cpu = 25

how can i achieve , close to the real parallelism using python-scrapy-scrapyd

or will i have to go with python-multi processing-pool-apply_async or some other solution

MrPandav
  • 1,831
  • 1
  • 20
  • 24
  • I had to copy my scrapyd config to /etc/scrapyd/conf.d/ on ubuntu to solve the opposite problem. I wanted only one job to run at a time but more than one was running. – rocktheartsm4l Jul 15 '15 at 17:03

0 Answers0