I'm attempting to extract data from numerous sites that don't have SSL certifications. I'm using the boilerpipe python wrapper to extract the text without HTML and write it to a text file.
I understand how to remove the SSL certification requirement in the requests library, but I can't seem to find a solution when it comes to boilerpipe. Boilerpipe is an amazing Java library for preparing scraped data for NLP so I'd love to be able to use it in Python.
Here's the code I'm running:
for url in urls:
extractor = Extractor(url='http://www.' + url)
extracted_text = extractor.getText()
with open('websitestext.txt', 'a') as webtextfile:
webtextfile.write(extracted_text)
And here's the error I think is causing the problems (the SSL certification):
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>