0

It is very weird, I wrote the scrapy code with its pipeline and crawled huge amount of data, it always worked well. Today when i re-run the same code, it suddenly doesn't work at all. Here are the details:

My Spider - base_url_spider.py

import re
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BaseURLSpider(CrawlSpider):
    '''
    This class is responsible for crawling globe and mail articles and their comments
    '''
    name = 'BaseURL'
    allowed_domains = ["www.theglobeandmail.com"]

    # seed urls
    url_path = r'../Sample_Resources/Online_Resources/sample_seed_urls.txt'
    start_urls = [line.strip() for line in open(url_path).readlines()]

    # Rules for including and excluding urls
    rules = (
    Rule(LinkExtractor(allow=r'\/opinion\/.*\/article\d+\/$'), callback="parse_articles"),
)

    def __init__(self, **kwargs):
        '''
        :param kwargs:
        Read user arguments and initialize variables
        '''
        CrawlSpider.__init__(self)

        self.headers = ({'User-Agent': 'Mozilla/5.0',
                     'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                     'X-Requested-With': 'XMLHttpRequest'})
        self.ids_seen = set()


    def parse_articles(self, response):
        article_ptn = "http://www.theglobeandmail.com/opinion/(.*?)/article(\d+)/"
        resp_url = response.url
        article_m = re.match(article_ptn, resp_url)
        article_id = ''
        if article_m != None:
            article_id = article_m.group(2)
            if article_id not in self.ids_seen:
                self.ids_seen.add(article_id)

                soup = BeautifulSoup(response.text, 'html.parser')
                content = soup.find('div', {"class":"column-2 gridcol"})
                if content != None:
                    text = content.findAll('p', {"class":''})
                    if len(text) > 0:
                            print('*****In Spider, Article ID*****', article_id)
                            print('***In Spider, Article URL***', resp_url)

                            yield {article_id: {"article_url": resp_url}}

If I only run my spider code, through command line scrapy runspider --logfile ../logs/log.txt ScrapeNews/spiders/article_base_url_spider.py. It can crawl all the urls in start_urls.

My Pipeline - base_url_pipelines.py

import json


class BaseURLPipelines(object):

    def process_item(self, item, spider):
        article_id = list(item.keys())[0]
        print("****Pipeline***", article_id)
        f_name = r'../Sample_Resources/Online_Resources/sample_base_urls.txt'
        with open(f_name, 'a') as out:
            json.dump(item, out)
            out.write("\n")

        return(item)

My settings - settings.py I have these lines uncommented:

BOT_NAME = 'ScrapeNews'
SPIDER_MODULES = ['ScrapeNews.spiders']
NEWSPIDER_MODULE = 'ScrapeNews.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'ScrapeNews.article_comment_pipelines.ArticleCommentPipeline': 400,
}

My scrapy.cfg This file should be used to indicate where is the settings file

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = ScrapeNews.settings

[deploy]
#url = http://localhost:6800/
project = ScrapeNews

All these things used to work pretty well together.

However, today when I re-run the code, I got this type of log output:

2017-04-24 14:14:15 [scrapy] INFO: Enabled item pipelines:
['ScrapeNews.article_comment_pipelines.ArticleCommentPipeline']
2017-04-24 14:14:15 [scrapy] INFO: Spider opened
2017-04-24 14:14:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None)
2017-04-24 14:14:20 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/austerity-is-here-all-that-matters-is-the-math/article627532/> (referer: None)
2017-04-24 14:14:24 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/ontario-can-no-longer-hide-from-taxes-restraint/article546776/> (referer: None)
2017-04-24 14:14:24 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.theglobeandmail.com/life/life-video/video-what-was-starbucks-thinking-with-their-new-unicorn-frappuccino/article34787773/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-04-24 14:14:31 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/for-palestinians-the-other-enemy-is-their-own-leadership/article15019936/> (referer: None)
2017-04-24 14:14:32 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/would-quebecs-partitiongo-back-on-the-table/article17528694/> (referer: None)
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK, shutting down gracefully. Send again to force 
2017-04-24 14:14:36 [scrapy] INFO: Closing spider (shutdown)
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK twice, forcing unclean shutdown

Compared with the above abnormal log output, if I only run my spider here, the log was fine, showing things like this:

2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/>
{'543479': {'article_url': 'http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/'}}
2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/>
{'624413': {'article_url': 'http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/'}}
2017-04-24 14:21:20 [scrapy] INFO: Closing spider (finished)
2017-04-24 14:21:20 [scrapy] INFO: Dumping Scrapy stats:

In the above abnormal log output, I have noticed something like robots:

2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None)

GET http://www.theglobeandmail.com/robots.txt has never appeared in the whole normal log output. But when I typed this in browser, didn't quite understand what is it. So I'm not sure whether it's because the website I am crawling added some bots?

Or the problem came from Received SIG_UNBLOCK, shutting down gracefully? But I didn't find any solution for this.

The command line I used to run the code is scrapy runspider --logfile ../../Logs/log.txt base_url_spider.py

Do you know how to deal with this problem?

Cherry Wu
  • 3,844
  • 9
  • 43
  • 63

1 Answers1

0

The robots.txt is a file which website use to let web crawler know if this site is allowed to be scraped. You set ROBOTSTXT_OBEY = True, which mean scrapy will obey the robots.txt's setting.

Change ROBOTSTXT_OBEY = False and it should work.

CK Chen
  • 625
  • 5
  • 15
  • Thank you very much for indicating this! I changed that to False, not the robots no longer exists, but still have `Received SIG_UNBLOCK, shutting down gracefully. Send again to force`. – Cherry Wu Apr 25 '17 at 04:54
  • What exactly command you use to run the spider when getting SIG_UNBLOCK? – CK Chen Apr 25 '17 at 06:01
  • I used `scrapy runspider --logfile ../../Logs/log.txt base_url_spider.py` – Cherry Wu Apr 25 '17 at 06:06
  • Receive SIG_UNBLOCK is unusual. Had you checked the system log to see which process send this signal? – CK Chen Apr 25 '17 at 06:13
  • you mean check the system log of my machine? or the log of scrapy? If its scrapy's log, it didn't tell. If it's my machine's log, do you know which log to check? I have never checked the log of my own machine. It's mac – Cherry Wu Apr 25 '17 at 06:39