I'm using Scrapy in Flask and Celery as a background task.
I start Celery as normal:
celery -A scrapy_flask.celery worker -l info
It works well...
However, I'm going to use WebSocket in scrapy to send data to Web page, so my code is changed in following three place:
socketio = SocketIO(app)
->socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
import eventlet eventlet.monkey_patch()
start celery with eventlet enable :
celery -A scrapy_flask.celery -P eventlet worker -l info
then the spider get error:Error downloading <GET http://www.XXXXXXX.com/>: DNS lookup failed: address 'www.XXXXXXX.com' not found: timeout error.
and here is my demo code:
# coding=utf-8
import eventlet
eventlet.monkey_patch()
from flask import Flask, render_template
from flask_socketio import SocketIO
from celery import Celery
app = Flask(__name__, template_folder='./')
# Celery configuration
app.config['CELERY_BROKER_URL'] = 'redis://127.0.0.1/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://127.0.0.1/0'
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
SOCKETIO_REDIS_URL = 'redis://127.0.0.1/0'
socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
from scrapy.crawler import CrawlerProcess
from TestSpider.start_test_spider import settings
from TestSpider.TestSpider.spiders.UpdateTestSpider import UpdateTestSpider
@celery.task
def background_task():
process = CrawlerProcess(settings)
process.crawl(UpdateTestSpider)
process.start() # the script will block here until the crawling is finished
@app.route('/')
def index():
return render_template('index.html')
@app.route('/task')
def start_background_task():
background_task.delay()
return 'Started'
if __name__ == '__main__':
socketio.run(app, host='0.0.0.0', port=9000, debug=True)
here is the logging:
[2016-11-25 09:33:39,319: ERROR/MainProcess] Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,320: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] ERROR: Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,420: INFO/MainProcess] Closing spider (finished)
[2016-11-25 09:33:39,421: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] INFO: Closing spider (finished)
[2016-11-25 09:33:39,422: INFO/MainProcess] Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
'downloader/request_bytes': 639,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 11, 25, 1, 33, 39, 421501),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 15,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 11, 25, 1, 30, 39, 15207)}