I'm trying to scrape product names and prices from this site, using the following code:
class ProductSpider(scrapy.Spider):
name = 'product'
start_urls = ['https://www.bodyenfitshop.nl/']
def parse(self, response):
# follow links to different categories
for href in response.css('ol.nav-primary > li.category-node > a::attr(href)'):
# A category page currently does not list all the items belonging to that category.
# Add "?p=1" to get the list view
href = href.extract() + "?p=1"
print(href)
yield SplashRequest(href, self.parse_category, args={
'wait': 0.5
})
def parse_category(self, response):
for product in response.css('li.item'):
new_name = product.css('div.product-name > a::text').extract_first().strip()
# TODO: find better regex
new_price = float(product.css('span.price::text').re('\d+,\d+')[0].replace(",", "."))
new_url = product.css('div.product-name > a::attr(href)').extract_first().strip()
yield Product(name=new_name, price=new_price, url=new_url)
I suspect the problem is in the href that is passed to the SplashRequest, but when I printed them out they all had fully qualified URLS like these:
https://www.bodyenfitshop.nl/duursport/?p=1
All other questions on SO with regards to this error (like this or this) are solved by adding the "https" to their URL's. But I already have these. So I'm at a loss as to what is causing these.
This is one of the errors that I get (repeated multiple times)
2017-05-31 22:51:07 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.
bodyenfitshop.nl/afslanken/?p=1 via https://www.bodyenfitshop.nl/afslanken/?p=1>
Traceback (most recent call last):
File "c:\program files (x86)\python\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "c:\program files (x86)\python\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "c:\program files (x86)\python\lib\site-packages\scrapy_splash\middleware.py", line 358, in process_request
priority=request.priority + self.rescheduling_priority_adjust
File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 94, in replace
return cls(*args, **kwargs)
File "c:\program files (x86)\python\lib\site-packages\scrapy_splash\request.py", line 76, in __init__
**kwargs)
File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: render.html
The render.html is (I think) the default name for what Splash produces. I don't think I can alter that.
Any help or poke in the right direction is greatly appreciated!