Ignore SSL verification for boilerpipe python wrapper web extractor?

Question

I'm attempting to extract data from numerous sites that don't have SSL certifications. I'm using the boilerpipe python wrapper to extract the text without HTML and write it to a text file.

I understand how to remove the SSL certification requirement in the requests library, but I can't seem to find a solution when it comes to boilerpipe. Boilerpipe is an amazing Java library for preparing scraped data for NLP so I'd love to be able to use it in Python.

Here's the code I'm running:

for url in urls:
    extractor = Extractor(url='http://www.' + url)
    extracted_text = extractor.getText()
    with open('websitestext.txt', 'a') as webtextfile:
        webtextfile.write(extracted_text)

And here's the error I think is causing the problems (the SSL certification):

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>

rainbowunicorn · Accepted Answer · 2018-03-19T19:41:30.533

It seems I found a solution with this:

import ssl

try:
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        # Legacy Python that doesn't verify HTTPS certificates by default
        pass
    else:
        # Handle target environment that doesn't support HTTPS verification
        ssl._create_default_https_context = _create_unverified_https_context

And by adding an exception:

for url in urls:
    try:
        extractor = Extractor(url='http://www.' + url)
        extracted_text = extractor.getText()
    except:
        pass
    with open('websitestext.txt', 'a') as webtextfile:
        webtextfile.write(extracted_text)

Ignore SSL verification for boilerpipe python wrapper web extractor?

1 Answers1