0

I'm using Scrapy in Flask and Celery as a background task. I start Celery as normal: celery -A scrapy_flask.celery worker -l info

It works well...

However, I'm going to use WebSocket in scrapy to send data to Web page, so my code is changed in following three place:

  • socketio = SocketIO(app) -> socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)

  • import eventlet eventlet.monkey_patch()

  • start celery with eventlet enable : celery -A scrapy_flask.celery -P eventlet worker -l info

then the spider get error:Error downloading <GET http://www.XXXXXXX.com/>: DNS lookup failed: address 'www.XXXXXXX.com' not found: timeout error.

and here is my demo code:

    # coding=utf-8
    import eventlet
    eventlet.monkey_patch()

    from flask import Flask, render_template
    from flask_socketio import SocketIO
    from celery import Celery

    app = Flask(__name__, template_folder='./')

    # Celery configuration
    app.config['CELERY_BROKER_URL'] = 'redis://127.0.0.1/0'
    app.config['CELERY_RESULT_BACKEND'] = 'redis://127.0.0.1/0'

    celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
    celery.conf.update(app.config)

    SOCKETIO_REDIS_URL = 'redis://127.0.0.1/0'
    socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)

    from scrapy.crawler import CrawlerProcess
    from TestSpider.start_test_spider import settings
    from TestSpider.TestSpider.spiders.UpdateTestSpider import UpdateTestSpider

    @celery.task
    def background_task():
        process = CrawlerProcess(settings)
        process.crawl(UpdateTestSpider)
        process.start() # the script will block here until the crawling is finished

    @app.route('/')
    def index():
        return render_template('index.html')

    @app.route('/task')
    def start_background_task():
        background_task.delay()
        return 'Started'

    if __name__ == '__main__':
        socketio.run(app, host='0.0.0.0', port=9000, debug=True)

here is the logging:

    [2016-11-25 09:33:39,319: ERROR/MainProcess] Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
    [2016-11-25 09:33:39,320: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] ERROR: Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
    [2016-11-25 09:33:39,420: INFO/MainProcess] Closing spider (finished)
    [2016-11-25 09:33:39,421: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] INFO: Closing spider (finished)
    [2016-11-25 09:33:39,422: INFO/MainProcess] Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
     'downloader/request_bytes': 639,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 11, 25, 1, 33, 39, 421501),
     'log_count/DEBUG': 4,
     'log_count/ERROR': 1,
     'log_count/INFO': 10,
     'log_count/WARNING': 15,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2016, 11, 25, 1, 30, 39, 15207)}
jiayi Peng
  • 325
  • 3
  • 11
  • That's quite an unusual setup. I have no idea if this is supposed to work. Are you able to sniff network traffic and check if the DNS queries are getting through and being replied to? – paul trmbrth Nov 24 '16 at 11:30
  • @paul trmbrth when I start celery with `celery -A scrapy_flask.celery worker -l info`, scrapy runs well. But when I enable eventlet in celery with `celery -A scrapy_flask.celery -P eventlet worker -l info`, the error occurs.... so that the network and DNS should work well.. – jiayi Peng Nov 24 '16 at 11:40
  • _"DNS should work well"_ but does not with eventlet. I have no idea what could explain this. I was suggesting you check if the DNS query actually is sent over the wire or not, or if the response is received and not processed. Hard debugging times ahead. – paul trmbrth Nov 24 '16 at 11:44
  • @paul trmbrth thanks. I tested my code in VPS, where the Internet and DNS is ok, but the error occured again. So there must be something wrong in my code, maybe run scrapy with eventlet is not a goog way? – jiayi Peng Nov 25 '16 at 02:05
  • @jiayiPeng You do not need to run Celery with eventlet, this is a requirement of the Flask-SocketIO server only. I recommend that you run the Flask server with eventlet, and the Celery workers in the regular way. See my [flack](https://github.com/miguelgrinberg/flack) application for an example that works in this way. – Miguel Grinberg Nov 25 '16 at 04:52

0 Answers0