1

I have created a basic spider to scrape a small group of job listings from totaljobs.com. I have set up the spider with a single start URL, to bring up the list of jobs I am interested in. From there, I launch a separate request for each page of the results. Within each of these requests, I launch a separate request calling back to a different parse method, to handle the individual job URLs.

What I'm finding is that the start URL and all of the results page requests are handled fine - scrapy connects to the site and returns the page content. However, when it attempts to follow the URLs for each individual job page, scrapy isn't able to form a connection. Within my log file, it states:

[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

I'm afraid that I don't have a huge amount of programming experience or knowledge of internet protocols etc. so please forgive me for not being able to provide more information on what might be going on here. I have tried changing the TLS connection type; updating to the latest version of scrapy, twisted and OpenSSL; rolling back to previous versions of scrapy, twisted and OpenSSL; rolling back the cryptography version, creating a custom Context Factory and trying various browser agents and proxies. I get the same outcome every time: whenever the URL relates to a specific job page, scrapy cannot connect and I get the above log file output.

It may be likely that I am overlooking something very obvious to seasoned scrapers, that is preventing me from connecting with scrapy. I have tried following some of the the advice in these threads:

https://github.com/scrapy/scrapy/issues/1429

https://github.com/requests/requests/issues/4458

https://github.com/scrapy/scrapy/issues/2717

However, some of it is a bit over my head e.g. how to update cipher lists etc. I presume that it is some kind of certification issue, but then again scrapy is able to connect to other URLs on that domain, so I don't know.

The code that I've been using to test this is very basic, but here it is anyway:

import scrapy

class Test(scrapy.Spider):


    start_urls = [
                    'https://www.totaljobs.com/job/welder/jark-wakefield-job79229824'
                    ,'https://www.totaljobs.com/job/welder/elliott-wragg-ltd-job78969310'
                    ,'https://www.totaljobs.com/job/welder/exo-technical-job79019672'
                    ,'https://www.totaljobs.com/job/welder/exo-technical-job79074694'
                        ]

    name = "test"

    def parse(self, response):
        print 'aaaa'
                yield {'a': 1}

The URLs in the above code are not being connected to successfully.

The URLs in the below code are being connected to successfully.

import scrapy

class Test(scrapy.Spider):


    start_urls = [
                    'https://www.totaljobs.com/jobs/permanent/welder/in-uk'
                    ,'https://www.totaljobs.com/jobs/permanent/mig-welder/in-uk'
                    ,'https://www.totaljobs.com/jobs/permanent/tig-welder/in-uk'
                        ]

    name = "test"

    def parse(self, response):
        print 'aaaa'
                yield {'a': 1}

It'd be great if someone could replicate this behavior (or not as the case may be) and let me know. Please let me know if I should submit additional details. I apologise, if I have overlooked something really obvious. I am using:

Windows 7 64 bit

Python 2.7

scrapy version 1.5.0

twisted version 17.9.0

openSSL version 17.5.0

lxml version 4.1.1

Mr. Pickles
  • 89
  • 1
  • 3
  • 7
  • Perhaps the site is doing some sort of robot detection and dropping the connection. – Paulo Scardine Jan 18 '18 at 20:13
  • I suspect that this is the case. I'd be interested to know if there is any way of pinpointing exactly what kind of measures are in place to prevent robot access. If I know what checks I can do to detect these kinds of measures, then I can avoid writing scrapes for these kinds of sites altogether, rather than devising a way to get around the ant-scraping measures. – Mr. Pickles Jan 19 '18 at 12:26
  • There are several measures to detect robots, including subtle things like in which order the headers are sent by Browsers. It is hard to tell unless you compare the traffic from a normal browser with the robot using something like Wireshark. – Paulo Scardine Jan 22 '18 at 17:21

3 Answers3

1

You can probably try setting a user agent and see if that changes things.

You might also try to do requests with bigger delays between them or with a proxy.

As it is a jobs website, I imagine they have some sort of anti-scraping mechanism.

This is not an amazing answer, but it is some insight I can share with you to maybe help you figure out your next steps.

JAntunes
  • 41
  • 1
  • 3
  • changing user agent works for me. For more detail check this question https://stackoverflow.com/questions/47402035/scrapy-twisted-connectionlost-error/57773673#57773673 – Aminah Nuraini Sep 03 '19 at 14:23
0

This is a link to a blog I recently read about responsible web scraping with Scrapy. Hopefully it's helpful

Eb J
  • 183
  • 1
  • 10
0

In my case, it was caused by user agent got rejected. We should change the user agent for each request. For that, you should use scrapy fake user agent and then use this middleware to make sure it changes the user agent in each retry.

from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *

from fake_useragent import UserAgent

class Retry500Middleware(RetryMiddleware):

    def __init__(self, settings):
        super(Retry500Middleware, self).__init__(settings)

        fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
        self.ua = UserAgent(fallback=fallback)
        self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')

    def get_ua(self):
        '''Gets random UA based on the type setting (random, firefox…)'''
        return getattr(self.ua, self.ua_type)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, exception, spider)
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108