23

I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type, all these spiders use same items, pipelines, parsing process

the contents of the project directory:

test/
├── scrapy.cfg
└── test
    ├── __init__.py
    ├── items.py
    ├── mybasespider.py
    ├── pipelines.py
    ├── settings.py
    ├── spider1_settings.py
    ├── spider2_settings.py
    └── spiders
        ├── __init__.py
        ├── spider1.py
        └── spider2.py

To reduce source code redundancy, mybasespider.py has a base spider MyBaseSpider, 95% source code are in it, all other spiders inherited from it, if a spider has some special things, override some class methods, normally only need to add several lines source code to create a new spider

Place all common settings in settings.py, one spider's special settings are in [spider name]_settings.py, such as:

the special settings of spider1 in spider1_settings.py:

from settings import *

LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
    'http://test1.com/',
]

the special settings of spider2 in spider2_settings.py:

from settings import *

LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
    'http://test2.com/',
]

Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider;

All urls in START_URLS are filled into MyBaseSpider.start_urls, different spider has different contents, but the name START_URLS used in the base spider MyBaseSpider isn't changed.

the contents of the scrapy.cfg:

[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings

[deploy]
url = http://localhost:6800/
project = test

To run a spider, such as spider1:

  1. export SCRAPY_PROJECT=spider1

  2. scrapy crawl spider1

But this way can't be used to run spiders in scrapyd. scrapyd-deploy command always uses 'default' project name in scrapy.cfg 'settings' section to build an egg file and deploys it to scrapyd

Have several questions:

  1. Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?

  2. How to separate a spider's special settings as above which can run in scrapyd and reduce source code redundancy

  3. If all spiders use a same JOBDIR, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?

Any insights would be greatly appreciated.

ma08
  • 3,654
  • 3
  • 23
  • 36
user3337861
  • 245
  • 2
  • 8
  • This seems like a great approach. Your strategy seems very manageable. I'm not sure how to answer 2. As far as 3 goes it should be safe to run them concurrently unless there is some 3rd party resource that will cause a race condition. – rocktheartsm4l Oct 14 '14 at 15:57
  • 1
    I wrote a tutorial for this, we are using it in our projects. [Using multiple spiders in a Scrapy project](http://lnxpgn.github.io/2015/07/27/using-multiple-spiders-in-a-scrapy-project/) – user3337861 Aug 01 '15 at 14:36
  • did you find a solution? – eLRuLL Nov 05 '15 at 16:56

3 Answers3

1

I don't know if it will answer to your first question but I use scrapy with multiple spiders and in the past i use the command

scrapy crawl spider1 

but if I had more then one spider this command activate it or another modules so I start to use this command:

scrapy runspider <your full spider1 path with the spiderclass.py> 

example: "scrapy runspider home/Documents/scrapyproject/scrapyproject/spiders/spider1.py"

I hope it will help :)

Shalom Balulu
  • 379
  • 1
  • 9
  • 20
1

As all spiders should have their own class, you could set the settings per spider with the custom_settings class argument, so something like:

Class MySpider1(Spider):
    name = "spider1"
    custom_settings = {'USER_AGENT': 'user_agent_for_spider1/version1'}

Class MySpider1(Spider):
    name = "spider1"
    custom_settings = {'USER_AGENT': 'user_agent_for_spider2/version2'}

this custom_settings will overwrite the ones on the settings.py file so you could still set some global ones.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
1

Good job! I didn't find better way to manager multiple spiders in the documentation.

I don't know about scrapyd. But when run from command line, you should set environment variable SCRAPY_PROJECT to the target project.

see scrapy/utils/project.py

ENVVAR = 'SCRAPY_SETTINGS_MODULE'

...

def get_project_settings():
    if ENVVAR not in os.environ:
        project = os.environ.get('SCRAPY_PROJECT', 'default')
        init_env(project)
misssprite
  • 616
  • 9
  • 17