I am using Scrapy
, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type,
all these spiders use same items, pipelines, parsing process
the contents of the project directory:
test/
├── scrapy.cfg
└── test
├── __init__.py
├── items.py
├── mybasespider.py
├── pipelines.py
├── settings.py
├── spider1_settings.py
├── spider2_settings.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
To reduce source code redundancy, mybasespider.py
has a base spider MyBaseSpider
, 95% source code are in it, all other spiders inherited from it, if a spider has some special things, override some
class methods
, normally only need to add several lines source code to create a new spider
Place all common settings in settings.py
, one spider's special settings are in [spider name]_settings.py
, such as:
the special settings of spider1
in spider1_settings.py
:
from settings import *
LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
'http://test1.com/',
]
the special settings of spider2
in spider2_settings.py
:
from settings import *
LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
'http://test2.com/',
]
Scrapy
uses LOG_FILE
, LOG_LEVEL
, JOBDIR
before launching a spider;
All urls in START_URLS
are filled into MyBaseSpider.start_urls
, different spider has different contents, but the name START_URLS
used in the base spider MyBaseSpider
isn't changed.
the contents of the scrapy.cfg
:
[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings
[deploy]
url = http://localhost:6800/
project = test
To run a spider, such as spider1
:
export SCRAPY_PROJECT=spider1
scrapy crawl spider1
But this way can't be used to run spiders in scrapyd
. scrapyd-deploy
command always uses 'default'
project name in scrapy.cfg
'settings' section to build an egg file
and deploys it to scrapyd
Have several questions:
Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?
How to separate a spider's special settings as above which can run in
scrapyd
and reduce source code redundancyIf all spiders use a same
JOBDIR
, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?
Any insights would be greatly appreciated.