generic spider for scrapy project

Question

i am creating generic spider (scrapy spider) for multiple websites. below is my project directory structure.

myproject <Directory>
--- __init__.py
--- common.py
--- scrapy.cfg
--- myproject <Directory>
    ---__init__.py
    ---items.py
    ---pipelines.py
    ---settings.py
    ---spiders <Directory>
       ---__init__.py
       ---spider.py              (generic spider)
       ---stackoverflow_com.py   (spider per website)
       ---anotherwebsite1_com.py (spider per website)
       ---anotherwebsite2_com.py (spider per website)

common.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
#
'''
    common file
'''

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.crawler import CrawlerProcess
from scrapy.stats import stats
from scrapy.http import Request
from WSS.items import WssItem
import MySQLdb, time
import urllib2, sys

#Database connection
def open_database_connection():
    connection = MySQLdb.connect(user=db_user, passwd=db_password, db=database, host=db_host, port=db_port, charset="utf8", use_unicode=True)
    cursor = connection.cursor()
    return connection, cursor

def close_database_connection(cursor, connection):
    cursor.close()
    connection.close()
    return

class Domain_:
    def __init__(self, spider_name, allowed_domains, start_urls, extract_topics_xpath, extract_viewed_xpath):
        self.spider_name = spider_name
        self.extract_topics_xpath = extract_topics_xpath
        self.extract_viewed_xpath = extract_viewed_xpath
        self.allowed_domains = allowed_domains
        self.start_urls = start_urls

spider.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
#

from common import *

class DomainSpider(BaseSpider):
    name = "generic_spider"
    def __init__(self, current_domain):
        self.allowed_domains = current_domain.allowed_domains
        self.start_urls = current_domain.start_urls
        self.current_domain = current_domain

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for topics in list(set(hxs.select(self.current_domain.extract_topics_xpath).extract())):
            yield Request(topics, dont_filter=True, callback=self.extract_all_topics_data)

    def extract_all_topics_data(self, response):
        hxs = HtmlXPathSelector(response)
        item = WssItem()
        print "Processing "+response.url
        connection, cursor = open_database_connection()
        for viewed in hxs.select(self.current_domain.extract_viewed_xpath).extract():
                    item['TopicURL']  = response.url
                    item['Topic_viewed'] = viewed
                    yield item
        close_database_connection(cursor, connection)
        return

stackoverflow_com.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
#

from common import *

current_domain = Domain_(
    spider_name = 'stackoverflow_com', 
    allowed_domains = ["stackoverflow.com"], 
    start_urls = ["http://stackoverflow.com/"], 
    extract_topics_xpath = '//div[contains(@class,\"bottomOrder\")]/a/@href',
    extract_viewed_xpath = '//div[contains(@class,\"views\")]/text()'
    )

import WSS.spiders.spider as spider
StackOverflowSpider = spider.DomainSpider(current_domain)

from the above scripts, i don't want to touch spider.py (assuming that all websites having same structure so i can use spider.py for all spiders)

i just want to create new spiders per website same as stackoverflow_com.py and i want to call spider.py for crawling process.

can you please advice me is there anything wrong on my code?. it showing below error message

output1: if i run "scrapy crawl stackoverflow_com" it shows below error message

    C:\myproject>scrapy crawl stackoverflow_com
2013-08-05 09:41:45+0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: WSS)
2013-08-05 09:41:45+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-08-05 09:41:45+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-08-05 09:41:45+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-08-05 09:41:45+0400 [scrapy] DEBUG: Enabled item pipelines: WssPipeline
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 156, in <module>
    execute()
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 76, in _run_print_help
    func(*a, **kw)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\commands\crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\spidermanager.py", line 43, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: stackoverflow_com'

output2: if i run "scrapy crawl generic_spider" it shows below error message

C:\myproject>scrapy crawl generic_spider
2013-08-05 12:25:15+0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: WSS)
2013-08-05 12:25:15+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-08-05 12:25:16+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-08-05 12:25:16+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-08-05 12:25:16+0400 [scrapy] DEBUG: Enabled item pipelines: WssPipeline
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 156, in <module>
    execute()
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 76, in _run_print_help
    func(*a, **kw)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\commands\crawl.py", line 43, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "C:\Python27\lib\site-packages\scrapy-0.16.4-py2.7.egg\scrapy\spidermanager.py", line 44, in create
    return spcls(**spider_kwargs)
TypeError: __init__() takes exactly 2 arguments (1 given)

Thanking you in advance :)

score 0 · Answer 1 · answered Aug 05 '13 at 07:23

0

Try following the template from http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

class DomainSpider(BaseSpider):
    name = "generic_spider"
    def __init__(self, current_domain, *args, **kwargs):
        self.allowed_domains = current_domain.allowed_domains
        self.start_urls = current_domain.start_urls
        super(DomainSpider, self).__init__(*args, **kwargs)
        self.current_domain = current_domain
    ...

answered Aug 05 '13 at 07:23

paul trmbrth

20,518
4
53
66

Hello Paul, thank you for your reply.. i have tried your code still i have same issue.. – AGR Aug 05 '13 at 07:43
Ah I see. I think the problem is that `C:\myproject>scrapy crawl generic_spider` does not receive the `current_domain` argument, you need to pass these parameters to `scrapy crawl` command somehow, for example using a pickled `Domain_` instance with `-a pickled_domain=stack_overflow.p` and process this pickled `Domain_` in `__init__(self, pickled_domain, *args, **kwargs): current_domain=pickle.load( open(pickled_domain, "rb" ) )...` – paul trmbrth Aug 05 '13 at 07:52
or simply with the name of the target spider instead of `generic_spider`? what does `scrapy crawl stackoverflow_com` do? (you may have to assign the instance to some name, e.g. `StackOverflowSpider = spider.DomainSpider(current_domain)` in stackoverflow_com.py – paul trmbrth Aug 05 '13 at 07:59
Hi Paul, i have tried your 2nd solution, i have executed "scrapy crawl stackoverflow_com" but still showing same error you can check the logs above. – AGR Aug 05 '13 at 08:31
Ah, I see, probably because `spider.DomainSpider(current_domain)` returns an instance of `generic_spider`while Scrapy expects spiders to be classes, not instances of a spider class. I think you need to use the `type()` function http://docs.python.org/2/library/functions.html#type to create your specific spider classes, something like `type('StackOverflowSpider', (DomainSpider,), current_domain)`. (Warning: not tested :) – paul trmbrth Aug 05 '13 at 08:59

score 0 · Answer 2 · answered Apr 18 '17 at 09:31

0

According to my experience scrapy throws spider not found bug only when spider name is not defined. Make sure your spider name i.e. stackoverflow_com is getting assign in DomainSpider try adding

self.name=current_domain.spider_name

answered Apr 18 '17 at 09:31

Karan Verma

158
2
6

generic spider for scrapy project

2 Answers2