0

I have a Scrapy spider wrapped in to Django and to start it programatically I am using scrapyscript which uses Billiard queues and Processes to run Scrapy. I know it's a strange setup but I need a spider that can be run by cron and work with django orm. Also, everything works fine, I get my data scraped, I can run it programatically, but one thing sucks: the spider gets stuck and code execution waits forever.

# start.py
cmd = ['C:\\PycharmProjects\\...\\start.bat', ]
subprocess.call(cmd)

# start.bat
python C:\\...\\manage.py runserver --noreload

# urls.py
from my_app.myFile import MyClass
c = MyClass()

# myFile.py
class MyClass(object):

    def __init__(self):
        githubJob = Job(ScrapOnePage, url='some/url')
        processor = Processor(settings=None)
        data = processor.run(githubJob)
        print(data)

ScrapOnePage works out great, no need to show it, it does the job. The problem is Processor, somehow after saying "spider closed" it doesn't let go and doesn;t continue to another line. print(data) never happens, no matter how long I wait. This is what hangs around forever:

2020-01-05 22:27:29 [scrapy.core.engine] INFO: Closing spider (finished)
2020-01-05 22:27:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1805,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 5933,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 1, 5, 21, 27, 29, 253634),
 'log_count/DEBUG': 5,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2020, 1, 5, 21, 27, 28, 275290)}
2020-01-05 22:27:29 [scrapy.core.engine] INFO: Spider closed (finished)

As said, the job gets done and I can put multiple spiders into the Process and they all work fine. But the process never stops. I could kill the django server via kill pid but I dislike this solution very much (also because I am not sure if it kills also the spider running inside django).

Please, any amelioration, simpler approach, any hints how to make the spider let go? Thanks in advance.

Stefan
  • 828
  • 12
  • 24
  • Have you tried to debug it? – ferran87 Jan 12 '20 at 10:43
  • Yes, I did. for some reason the Queue gets stuck when delivering results in the scarpyscript library. So, initially I commented that part out, but that caused some other issues and then I decided to go for a completely different solution. – Stefan Jan 13 '20 at 10:24

0 Answers0