1

I am trying to crawl this site

https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

with scrapy and keep getting twisted request/disconnection errors. I am not using a proxy and I tried both setting the user agent and actually setting all the headers based on this answer

here is the code generating the request

def start_requests(self):
    url = 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs'

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive',
        'DNT': '1',
        'Host': 'www5.apply2jobs.com',
        'Referer': 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=2524&CurrentPage=2',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'
    }

    yield Request(url=url, headers=headers, callback=self.parse)

and this is my traceback:

2017-08-28 13:34:13 [scrapy.core.engine] INFO: Spider opened
2017-08-28 13:34:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-28 13:34:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www5.apply2jobs.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.scraper] ERROR: Error downloading <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.engine] INFO: Closing spider (finished)
gr3yh47
  • 41
  • 1
  • 1
  • 5
  • Did you change user agent in your settings file also? This most probably is request rejected by the server. So probably scraping protection – Tarun Lalwani Aug 28 '17 at 19:09
  • @TarunLalwani I have done that as well. any other ideas? – gr3yh47 Aug 28 '17 at 20:10
  • how does `curl -v 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs'` behave? Are you able to connect? To dig deeper, you'll probably need to use a network sniffer like Wireshark (though this being an HTTPS connection, one may not see much more than the initial TLS handshake) – paul trmbrth Aug 29 '17 at 13:48
  • So I checked and it seems there is some issue with the website. If you do a `curl -v ""`. It will end up with an exception `* GnuTLS recv error (-110): The TLS connection was non-properly terminated. * Closing connection 0 curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly terminated.` Same thing works in browser, but i see the site uses old ciphers, old TLS 1.0. I would suggest you open a issue with scrapy with this particular URL and see if they have something. But this issue is specific to this site and has to do something with ciphers or something in depth – Tarun Lalwani Aug 29 '17 at 15:11
  • @paultrmbrth, this works in browser but not in scrapy or curl. – Tarun Lalwani Aug 29 '17 at 15:12
  • I'm able to connect but with a few things tweaked: `scrapy shell "https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs" -s DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.0` (setting TLS 1.0) and using OpenSSL 1.0.2 (at least when using OpenSSL 1.1.0f, it failed for me). I'm not sure if it has to do with OpenSSL 1.1 or OpenSSL 1.1 + Twisted – paul trmbrth Aug 29 '17 at 16:03
  • It looks like [OpenSSL 1.1.0 removed RC4-MD5](https://www.openssl.org/news/changelog.html#x7) which is the cipher that the server negotiates (in my tests with `OpenSSL 1.0.2g 1 Mar 2016`). @gr3yh47, can you check what version of OpenSSL you are using? (it appears in `scrapy version -v`) – paul trmbrth Aug 29 '17 at 17:12
  • @paultrmbrth (and Tarun) wow, thank you for working hard on this. version is pyOpenSSL : 17.0.0 (OpenSSL 1.1.0e 16 Feb 2017) – gr3yh47 Aug 29 '17 at 17:46
  • @paultrmbrth any ideas what i can do besides downgrading? i found this but not sure what to do with it to make scrapy use it https://code.launchpad.net/~njoyce512/pyopenssl/rc4 – gr3yh47 Aug 29 '17 at 19:09
  • See [my comment on GitHub](https://github.com/scrapy/scrapy/issues/2311#issuecomment-325804964). – paul trmbrth Aug 29 '17 at 21:20

1 Answers1

3

So thanks to discussion on github as well as in the comments to my question, it looks like the best course of action is to use a virtualenv with cryptography<2

credit to @paultrmbrth for helping so much

I tried compiling OpenSSL 1.1.0f with enable-weak-ssl-ciphers to build a static wheel, but I didn't manage to have it support TLS_RSA_WITH_RC4_128_MD5 (as ssllabs.com reports) for some reason. I'm laking OpenSSL building knowledge apparently. So the only option I see is to use a virtualenv with 'cryptography<2' for scraping that website.

gr3yh47
  • 41
  • 1
  • 1
  • 5