I have created a basic spider to scrape a small group of job listings from totaljobs.com. I have set up the spider with a single start URL, to bring up the list of jobs I am interested in. From there, I launch a separate request for each page of the results. Within each of these requests, I launch a separate request calling back to a different parse method, to handle the individual job URLs.
What I'm finding is that the start URL and all of the results page requests are handled fine - scrapy connects to the site and returns the page content. However, when it attempts to follow the URLs for each individual job page, scrapy isn't able to form a connection. Within my log file, it states:
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
I'm afraid that I don't have a huge amount of programming experience or knowledge of internet protocols etc. so please forgive me for not being able to provide more information on what might be going on here. I have tried changing the TLS connection type; updating to the latest version of scrapy, twisted and OpenSSL; rolling back to previous versions of scrapy, twisted and OpenSSL; rolling back the cryptography version, creating a custom Context Factory and trying various browser agents and proxies. I get the same outcome every time: whenever the URL relates to a specific job page, scrapy cannot connect and I get the above log file output.
It may be likely that I am overlooking something very obvious to seasoned scrapers, that is preventing me from connecting with scrapy. I have tried following some of the the advice in these threads:
https://github.com/scrapy/scrapy/issues/1429
https://github.com/requests/requests/issues/4458
https://github.com/scrapy/scrapy/issues/2717
However, some of it is a bit over my head e.g. how to update cipher lists etc. I presume that it is some kind of certification issue, but then again scrapy is able to connect to other URLs on that domain, so I don't know.
The code that I've been using to test this is very basic, but here it is anyway:
import scrapy
class Test(scrapy.Spider):
start_urls = [
'https://www.totaljobs.com/job/welder/jark-wakefield-job79229824'
,'https://www.totaljobs.com/job/welder/elliott-wragg-ltd-job78969310'
,'https://www.totaljobs.com/job/welder/exo-technical-job79019672'
,'https://www.totaljobs.com/job/welder/exo-technical-job79074694'
]
name = "test"
def parse(self, response):
print 'aaaa'
yield {'a': 1}
The URLs in the above code are not being connected to successfully.
The URLs in the below code are being connected to successfully.
import scrapy
class Test(scrapy.Spider):
start_urls = [
'https://www.totaljobs.com/jobs/permanent/welder/in-uk'
,'https://www.totaljobs.com/jobs/permanent/mig-welder/in-uk'
,'https://www.totaljobs.com/jobs/permanent/tig-welder/in-uk'
]
name = "test"
def parse(self, response):
print 'aaaa'
yield {'a': 1}
It'd be great if someone could replicate this behavior (or not as the case may be) and let me know. Please let me know if I should submit additional details. I apologise, if I have overlooked something really obvious. I am using:
Windows 7 64 bit
Python 2.7
scrapy version 1.5.0
twisted version 17.9.0
openSSL version 17.5.0
lxml version 4.1.1