0

I have a scrapy spider which was working as expected for a while, but now returning empty response.

class BossSpider(scrapy.Spider):
    name = 'bossaz'
    allowed_domains = ['boss.az']
    start_urls = ['https://boss.az/vacancies']

    def parse(self, response):
        for href in response.xpath('//a[@class="results-i-link"]/@href'):
            yield response.follow(href, self.parse_jobs)

        next_page = response.xpath('//span[@class="next"]/a[@rel="next"]/@href').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def parse_jobs(self, response):
        scraped_data = dict()
        scraped_data['job_title'] = response.xpath('//h1[@class="post-title"]/text()').extract_first()
        scraped_data['employer'] = response.xpath('//a[@class="post-company"]/text()').extract_first()
        scraped_data['published'] = response.xpath('//div[@class="bumped_on params-i-val"]/text()').extract_first()
        scraped_data['details'] = response.xpath('//div[@class="post-cols post-info"]').extract()
        yield scraped_data

Right now above code returns the stats below when I run spider in my machine:

{'downloader/request_bytes': 431,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 304,
 'downloader/response_count': 2,
 'downloader/response_status_count/204': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 30, 5, 30, 18, 860994),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'memusage/max': 53403648,
 'memusage/startup': 53403648,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 8, 30, 5, 30, 17, 554091)}

I also tried to get result in terminal by typing scrapy shell https://boss.az/vacancies. In terminal, response.body also returns empty string. Note that, I checked the website's HTML code and there is no structural change. What can be reason for this spider to return HTTP status 204?

Elgin Cahangirov
  • 1,932
  • 4
  • 24
  • 45
  • it is working for me, I'm using USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" in settings – Hassan Raza Aug 30 '18 at 10:06
  • is is working now after I added that line to settings file. But still don't understand the reason. I have other spiders in the same project and they are working properly without changing USER_AGENT. Anyway, thanks for the help! – Elgin Cahangirov Aug 30 '18 at 10:28

0 Answers0