1

i tried using CLOSESPIDER_TIMEOUT extension in the settings to kill the spiders which are running beyond 3 hours.

CLOSESPIDER_TIMEOUT = 3 * 60 * 60

Although the spiders receive the close timeout request, It never actually stops the spider, which keeps on running.

Any ideas on what's wrong in this case?

codehia
  • 124
  • 2
  • 4
  • 16

1 Answers1

2

If your spider get the close timeout request, the extension seems to be working. It doesn't look like anything is wrong, but you might have to wait a bit before the spider fully closes, as he will first finish the already scheduled requests before shutting down completely.

Wim Hermans
  • 2,098
  • 1
  • 9
  • 16
  • That was my assumption as well, However, when the spider runs for more than a day with a timeout of 3 hours. I'm assuming something is wrong. Any ideas? – codehia Apr 30 '20 at 08:39
  • 1
    I confirm that after `CLOSESPIDER_TIMEOUT` - scrapy will stop to schedule new requests and it will continue execute already scheduled requests after `CLOSESPIDER_TIMEOUT`. Also you can check this [solution](https://stackoverflow.com/a/55877309/10884791) - calling `os._exit(0)` – Georgiy Apr 30 '20 at 10:47
  • Shouldn't take that long indeed. did you define any close_spider methods in the pipelines/extensions? or an image/file download pipeline that is blocking? you can also try to use telnet (https://docs.scrapy.org/en/latest/topics/telnetconsole.html) when it looks blocked to see whether you have any progress in the queues – Wim Hermans May 01 '20 at 05:37