0

I am using this example here. To change my identity with Tor/Privoxy but i have faced several issues such as having to type "scrapy crawl something.py" multiple times to start the spider or having the spider stop abruptly in the middle of a crawl without any sort of error message.

something.py

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://jobscentral.com.sg/jobs-it',
    ]

    custom_settings = {
                       'TOR_RENEW_IDENTITY_ENABLED': True,
                       'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 20
                       }

    download_delay = 4
    handle_httpstatus_list = [301, 302]

    rules = (
        #Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback="self.parse", follow=True),
        Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='self.parse', follow=True),
    )

    def parse(self, response): 
            items = []

            self.logger.info("Visited Outer Link %s", response.url)

            for sel in response.xpath('//h4'):
                item = JobsItems()

            next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="Next"]/@href').extract_first()

            if next_page:
                base_url = get_base_url(response)
                absolute_next_page = urljoin(base_url,next_page)
                yield scrapy.Request(absolute_next_page, self.parse, dont_filter=True)

    def parse_jobdetails(self, response):

        self.logger.info('Visited Internal Link %s', response.url)
        print response
        item = response.meta['item']
        item = self.getJobInformation(item, response)
        return item

    def getJobInformation(self, item, response):
        trans_table = {ord(c): None for c in u'\r\n\t\u00a0'}

        item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first()
        return item

Error message when it fails to start crawling:

2017-09-12 16:55:09 [scrapy.middleware] INFO: Enabled item pipelines:
['jobscentral.pipelines.JobscentralPipeline']
2017-09-12 16:55:09 [scrapy.core.engine] INFO: Spider opened
2017-09-12 16:55:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-12 16:55:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-12 16:55:11 [scrapy.extensions.throttle] INFO: slot: jobscentral.com.sg | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes
2017-09-12 16:55:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://jobscentral.com.sg/jobs-it> (referer: None)
2017-09-12 16:55:11 [IT] INFO: got response 200 for 'https://jobscentral.com.sg/jobs-it'
2017-09-12 16:55:11 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 16:55:11 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 16:55:11 [IT] DEBUG: Closing connection pool...

EDIT: error log

<<<HUGE CHUNK OF HTML>> from response.body here
---------------------------------------------------------
2017-09-12 17:39:01 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 17:39:01 [IT] DEBUG: Closing connection pool...
2017-09-12 17:39:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 290,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 68352,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612),
 'log_count/DEBUG': 4,
 'log_count/INFO': 12,
 'memusage/max': 58212352,
 'memusage/startup': 58212352,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)}
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Spider closed (finished)
dythe
  • 840
  • 5
  • 21
  • 45
  • Most probably you are getting a 200 response, but the response as such is blank or something else like a captcha . So make sure to print the response. This will show you the response html and you can see what is different when spider doesn't start. – Tarun Lalwani Sep 12 '17 at 09:10
  • Do i just print out the response.body to check whether the 200 response gotten any sort of html? – dythe Sep 12 '17 at 09:30
  • Yes, add `print(response.body)` – Tarun Lalwani Sep 12 '17 at 09:32
  • managed to get 1 instance that it wasn't working there was still a response from the response.body with all the html from the webpage. Edited the post on top. – dythe Sep 12 '17 at 09:42
  • Add `yield from super(IT, self).parse(response)` inside `def parse(self, response):` at the very top and see if it helps – Tarun Lalwani Sep 12 '17 at 09:50
  • yield from super(IT, self).parse(response) without the from? EDIT: doesn't seem to help still stops the spider right away after starting it. – dythe Sep 12 '17 at 09:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154221/discussion-between-tarun-lalwani-and-dythe). – Tarun Lalwani Sep 12 '17 at 10:06
  • Did it resolve? I am facing the same issue – Vineet Oct 01 '20 at 18:55

0 Answers0