2

I'm trying to scrape product names and prices from this site, using the following code:

class ProductSpider(scrapy.Spider):
name = 'product'

start_urls = ['https://www.bodyenfitshop.nl/']

def parse(self, response):
    # follow links to different categories
    for href in response.css('ol.nav-primary > li.category-node > a::attr(href)'):
        # A category page currently does not list all the items belonging to that category.
        # Add "?p=1" to get the list view
        href = href.extract() + "?p=1"
        print(href)
        yield SplashRequest(href, self.parse_category, args={
            'wait': 0.5
        })

def parse_category(self, response):
    for product in response.css('li.item'):
        new_name = product.css('div.product-name > a::text').extract_first().strip()
        # TODO: find better regex
        new_price = float(product.css('span.price::text').re('\d+,\d+')[0].replace(",", "."))
        new_url = product.css('div.product-name > a::attr(href)').extract_first().strip()
        yield Product(name=new_name, price=new_price, url=new_url)

I suspect the problem is in the href that is passed to the SplashRequest, but when I printed them out they all had fully qualified URLS like these:

https://www.bodyenfitshop.nl/duursport/?p=1

https://www.bodyenfitshop.nl/workouts/?p=1

https://www.bodyenfitshop.nl/aminozuren/?p=1

All other questions on SO with regards to this error (like this or this) are solved by adding the "https" to their URL's. But I already have these. So I'm at a loss as to what is causing these.

This is one of the errors that I get (repeated multiple times)

2017-05-31 22:51:07 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.

bodyenfitshop.nl/afslanken/?p=1 via https://www.bodyenfitshop.nl/afslanken/?p=1>
Traceback (most recent call last):
  File "c:\program files (x86)\python\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
    result = g.send(result)
  File "c:\program files (x86)\python\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
    response = yield method(request=request, spider=spider)
  File "c:\program files (x86)\python\lib\site-packages\scrapy_splash\middleware.py", line 358, in process_request
    priority=request.priority + self.rescheduling_priority_adjust
  File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 94, in replace
    return cls(*args, **kwargs)
  File "c:\program files (x86)\python\lib\site-packages\scrapy_splash\request.py", line 76, in __init__
    **kwargs)
  File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
    self._set_url(url)
  File "c:\program files (x86)\python\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: render.html

The render.html is (I think) the default name for what Splash produces. I don't think I can alter that.

Any help or poke in the right direction is greatly appreciated!

Community
  • 1
  • 1
HDW
  • 308
  • 2
  • 14
  • What do you have as `SPLASH_URL` setting? – paul trmbrth Jun 01 '17 at 09:32
  • @paultrmbrth As I am running the splash server locally: SPLASH_URL = '127.0.0.1:8050'. There's no scheme supplied here, but other tutorials also just used an IP. Can't check now (not on my dev pc), but would using "http://localhost:8050" fix it? – HDW Jun 01 '17 at 12:06
  • Please use the full URL with scheme, ie `SPLASH_URL = 'http://127.0.0.1:8050'`. Without it, I can indeed reproduce your error. What tutorial are you following that show the URL without the scheme? [The official README](https://github.com/scrapy-plugins/scrapy-splash) uses the full URL. – paul trmbrth Jun 01 '17 at 12:47
  • @paultrmbrth I have misread the (official) tutorail (as well as some other sites). They are correct. Adding the scheme to my SPLASH_URL fixed the problem. Was there a way I could have deduced this from the error messages I got? I'm new to this framework and still learning. Thanks for the help btw! – HDW Jun 01 '17 at 15:00
  • 2
    I've opened [an issue](https://github.com/scrapy-plugins/scrapy-splash/issues/120) to try and make the error more obvious. – paul trmbrth Jun 01 '17 at 16:34
  • @paultrmbrth Nice, thanks! – HDW Jun 01 '17 at 18:39

1 Answers1

0

In my case, it was because I forgot to put https:// to the SPLASH_URL

So it should be SPLASH_URL = https://yourdomain.com:8050

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108