0

I'm currently writing a scraper with scrapy. For some websites it works just fine, but for other i get the error

Error reading file '': failed to load external entity ""

Here is the code I wrote for my scraper, don't blame me but I'm still a beginner in python.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from bs4 import BeautifulSoup
import lxml
from lxml.html.clean import Cleaner
#from scrapy.exporters import XmlItemExporter
import re

cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.remove_tags = ['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'figure', 'small', 'blockquote', 'sub', 'em', 'hr', '!--..--', 'span', 'aside', 'a', 'svg', 'ul', 'li', 'img', 'source', 'nav', 'article', 'section', 'label', 'br', 'noscript', 'body', 'time', 'b', 'i', 'sup', 'strong', 'div']
cleaner.kill_tags = ['header', 'footer']

class MySpider(CrawlSpider):
    name = 'eship5'
    allowed_domains = [
    'ineratec.de',
    ]

    start_urls = [
    'http://ineratec.de/',
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).


    def parse_item(self, response):
        page = response.url.replace("/"," ").replace(":"," ")
        filename = '%s.txt' %page
        body = response.url
        clean_text = lxml.html.tostring(cleaner.clean_html(lxml.html.parse(body)))
        #clean_text = re.sub( '\s+', ' ', str(clean_text, "utf-8").replace('<div>', '').replace('</div>', '')).strip()
        with open(filename, 'w') as f:
            f.write(clean_text)

When I run the code with scrapy, the error occurs only on certain websites. Does it has anything to do with the ' ' and the " " ?? Thankful for any help.

EDIT1: This is the whole error:

2018-06-28 14:01:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://smight.com/en/> (referer: https://smight.com/)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
    for x in result:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spiders/crawl.py", line 76, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/gnus/Desktop/scraper/scraper/spiders/scraper.py", line 33, in parse_item
    clean_text = lxml.html.tostring(cleaner.clean_html(lxml.html.parse(body)))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/lxml/html/__init__.py", line 940, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 637, in lxml.etree._raiseParseError
OSError: Error reading file 'https://smight.com/en/': failed to load external entity "https://smight.com/en/"
Magnus Vivadeepa
  • 57
  • 1
  • 2
  • 11
  • Add the complete exception stacktrace – Tarun Lalwani Jun 28 '18 at 11:10
  • Hi Tarun, thanks for the quick answer. But what do you mean by add the complete exception stacktrace? Sorry for asking, but I'm completely new to scrapy and python... – Magnus Vivadeepa Jun 28 '18 at 11:34
  • With `Error reading file '': failed to load external entity ""` you must be get stack trace of the error? The location of file where the error generated and propagated. Need that to understand error – Tarun Lalwani Jun 28 '18 at 11:47
  • Hi Tarun, i edited my post. Hope this helps to solve my problem. Thanks in advance!! – Magnus Vivadeepa Jun 28 '18 at 12:03
  • See if this helps https://stackoverflow.com/questions/10457564/error-failed-to-load-external-entity-when-using-python-lxml – Tarun Lalwani Jun 28 '18 at 12:33
  • I saw this post earlier, but didn't quite get what they were saying. I changed lxml.html.parse(body) to lxml.html.parse(StringIO(body)). Then I got the error, write() argument must be str, not bytes. So i changed w to wb in the second last line. Now I get an output, but nothing more than
    linkname
    – Magnus Vivadeepa Jun 28 '18 at 14:07
  • Okie will later run the code and check – Tarun Lalwani Jun 28 '18 at 14:08
  • thanks a lot Tarun!! I think the problem is, that i only return the url in response.url and not the html file or rather the body... – Magnus Vivadeepa Jun 28 '18 at 14:16
  • Yeah that should `response.body` or `response.content` whichever works – Tarun Lalwani Jun 28 '18 at 14:18
  • when I change response.url with with response.content I get the error: AttributeError: 'HtmlResponse' object has no attribute 'content' and with reponse.body I get the error: TypeError: initial_value must be str or None, not bytes – Magnus Vivadeepa Jun 28 '18 at 14:29
  • i put the print(body) function after body = response.body and in the terminal i get the body. So the error has to be in the clean_text function because when i add print(clean_text) I get the error – Magnus Vivadeepa Jun 28 '18 at 14:39
  • i removed the clean_text line and tried to get an output. The good thing is, now I get an output, but the bad thing is, that it is the whole HTML file including the header, which I don't need. – Magnus Vivadeepa Jun 28 '18 at 14:45
  • is there an easy way to clean the html, without the Cleaner I imported? I just want the text that is shown on each website. something like htmltidy.net – Magnus Vivadeepa Jun 28 '18 at 15:03

0 Answers0