0

I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time.

What I see in the log is something like this:

2220:   2019-07-07 22:51:50 WARNING [tools.textraction] PDF contains only images - running OCR.
2221:   2019-07-08 00:00:03 INFO    [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force 

The SIGTERM always arrives about one hour after the message saying the OCR has started, so I'm assuming there's a mechanism that kills everything if there's no new request or item for one hour.

How can I hook into that, and prevent the shutting down? Is this an example of signal.spider_idle?

kenshin
  • 197
  • 11
  • Are you using the scrapinghub platform for this job? If so they have limitations for free accounts. – Rafael Almeida Jul 08 '19 at 12:43
  • @RafaelAlmeida yes I am, but it's not a free account. – kenshin Jul 08 '19 at 12:44
  • I would still contact scrapinghub because `SIGTERM` is being triggered by them. As a side note, if it takes too long to process that item, you should just save the PDF file and then do that from another application. – Rafael Almeida Jul 08 '19 at 12:58
  • 1
    Since you are performing OCR, is it possible that the issue is caused by a high memory usage? – Gallaecio Jul 09 '19 at 09:39
  • @Gallaecio could be. However, I had other cases when the result was `cancelled (out of memory)`; plus, it's always after one hour... So I'd say no – kenshin Jul 09 '19 at 13:14

0 Answers0