1

I integrated scrapy in my Django project following this guide
Unfortunately, In any way I try, the spider jobs are not starting, even if schedule.json gives me a jobid in return.

My views:

@csrf_exempt
@api_view(['POST'])
def crawl_url(request):
    url = request.POST.get('url', None)  # takes url from request
    if not url:
        return JsonResponse({'error': 'Missing  args'})
    if not is_valid_url(url):
        return JsonResponse({'error': 'URL is invalid'})

    domain = urlparse(url).netloc  # parse the url and extract the domain
    unique_id = str(uuid4())  # creates a unique ID.

    # Custom settings for scrapy spider.
    # We can send anything we want to use it inside spiders and pipelines.
    settings = {
        'unique_id': unique_id,  # unique ID for each record for DB
        'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
    }

    # Schedule a new crawling task from scrapyd.
    # settings is a special argument name.
    # This returns an ID which belongs to this task, used to check the task status
    task = scrapyd.schedule('default', 'kw_spider', settings=settings, url=url, domain=domain)

    return JsonResponse({'task_id': task, 'unique_id': unique_id, 'status': 'started'})


@csrf_exempt
@api_view(['GET'])
def get_crawl_data(request):
    task_id = request.GET.get('task_id', None)
    unique_id = request.GET.get('unique_id', None)

    if not task_id or not unique_id:
        return JsonResponse({'error': 'Missing args'})

    # Check status of crawling
    # If finished, makes query from database and get results
    # If not, return active status
    # Possible results are -> pending, running, finished
    status = scrapyd.job_status('default', task_id)
    if status == '' or status is None:
        return JsonResponse({
            'status': 'error',
            'data': 'Task not found'
        })
    elif status == 'finished':
        try:
            item = ScrapyItem.objects.get(unique_id=unique_id)
            return JsonResponse({
                'status': status,
                'data': item.to_dict['data']
            })
        except Exception as e:
            return JsonResponse({
                'status': 'error',
                'data': str(e)
            })
    else:
        return JsonResponse({
            'status': status,
            'data': {}
        })

My spider:

class KwSpiderSpider(CrawlSpider):
    name = 'kw_spider'

    def __init__(self, *args, **kwargs):
        # __init__ overridden to have a dynamic spider
        # args passed from django views
        self.url = kwargs.get('url')
        self.domain = kwargs.get('domain')
        self.start_urls = [self.url]
        self.allowed_domains = [self.domain]

        KwSpiderSpider.rules = [
            Rule(LinkExtractor(unique=True), callback='parse_item'),
        ]
        super(KwSpiderSpider, self).__init__(*args, **kwargs)

    def parse_item(self, response):
        resp_dict = {
            'url': response.url
        }
        # resp_dict['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        # resp_dict['name'] = response.xpath('//div[@id="name"]').extract()
        # resp_dict['description'] = response.xpath('//div[@id="description"]').extract()
        return resp_dict

I also tried with a curl call
curl http://localhost:6800/schedule.json -d project=default -d spider=kw_spider
which gave me the following response:
{"node_name": "9jvtf82", "status": "ok", "jobid": "0ca057026e5611e8898f64006a668b22"}

But nothing happens, the job doesn't start

Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53

1 Answers1

3

I solved it by noticing an error in the scrapyd console log.
I was missing the pywin32 library, though I don't understand why this wasn't in the requirements.

A simple
pip install pywin32
fixed it

Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53