I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time.
What I see in the log is something like this:
2220: 2019-07-07 22:51:50 WARNING [tools.textraction] PDF contains only images - running OCR.
2221: 2019-07-08 00:00:03 INFO [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force
The SIGTERM
always arrives about one hour after the message saying the OCR has started, so I'm assuming there's a mechanism that kills everything if there's no new request or item for one hour.
How can I hook into that, and prevent the shutting down? Is this an example of signal.spider_idle
?