1

Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to tell scrapyd to only have 1 running spider at a time and leave the rest in the pending queue. I didn't see a configuration option for this in the docs. Any help would be much appreciated!

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
rocktheartsm4l
  • 2,129
  • 23
  • 38
  • What kind of shared resources do you have? – alecxe Jul 25 '14 at 16:31
  • I have an sqlite file that I write to. Every once in awhile I get a cannot connect error. Also I'm using phantomjs and selenium to handle dynamic (javascript) content. Sometimes phantomjs's GhostDriver seems to get blocked due to a race condition. – rocktheartsm4l Jul 25 '14 at 18:18

1 Answers1

0

This can be controlled by scrapyd settings. Set max_proc to 1:

max_proc

The maximum number of concurrent Scrapy process that will be started.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Does max proc keep requests from being made asynchronously? That is why I didn't use it. It was unclear to me if this would be the case. This could be a lack of understanding on my part, follow up question: Does scrapy actually spawn new processes or threads to handle requests asynchronously or is there some kind of twisted framework "magic" making this happen? – rocktheartsm4l Jul 25 '14 at 18:16
  • @rocktheartsm4l requests would be async anyway since there is twisted under-the-hood. `max_proc` just helps to have a single spider running at a time. This is how I understand this. What kind of resources are shared among spiders and slowing things down? I think you need to fix it instead of trying to make it run in a blocking mode.. – alecxe Jul 25 '14 at 18:18
  • Answered that one above. Thanks for the quick responses. – rocktheartsm4l Jul 25 '14 at 18:21
  • @rocktheartsm4l ok, yeah, first of all, sqlite is really not a good choice here since it blocks the whole database on writes. Switch to postgresql, or mysql etc in case you need classic relational database, or to mongodb, or redis etc in case you need a NoSQL solution..also, elaborate phantomjs problem into a separate question with details. Thanks. – alecxe Jul 25 '14 at 18:21
  • Thanks for the insight into sqlite. Right now my project is a prototype and I'm just using the sqlite file as a dummy database till I hook my project up to the real database next week. I'll only be using the max_proc = 1 till then. I'll make a new question about the phantomjs problem. – rocktheartsm4l Jul 25 '14 at 18:24
  • Don't know if this is still in your knowledge base but here is the follow up question: http://stackoverflow.com/questions/24962520/using-phantomjs-for-dynamic-content-with-scrapy-and-selenium-possible-race-condi – rocktheartsm4l Jul 25 '14 at 18:45
  • @rocktheartsm4l it is not directly in my knowledge base, but the things you are tackling with are very much connected to what I'm doing at one of mine projects. I will pay attention to it. – alecxe Jul 25 '14 at 18:48
  • @rocktheartsm4l by the way, is phantomjs logging critical? Turning it off can be an option :) – alecxe Jul 25 '14 at 18:49
  • Not critical at all! If I don't find another solution I'll point phantomjs log_path at dev null. And thanks! – rocktheartsm4l Jul 25 '14 at 18:55