29

I've been stuck on this bug for a while, the following error message is as follows:

File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
            raise ValueError('Missing scheme in request url: %s' % self._url)
            exceptions.ValueError: Missing scheme in request url: h

Scrapy code:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import Selector
    from scrapy.http import Request
    from spyder.items import SypderItem

    import sys
    import MySQLdb
    import hashlib
    from scrapy import signals
    from scrapy.xlib.pydispatch import dispatcher

    # _*_ coding: utf-8 _*_

    class some_Spyder(CrawlSpider):
        name = "spyder"

        def __init__(self, *a, **kw):
            # catch the spider stopping
            # dispatcher.connect(self.spider_closed, signals.spider_closed)
            # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)

            self.allowed_domains = "domainname.com"
            self.start_urls = "http://www.domainname.com/"
            self.xpaths = '''//td[@class="CatBg" and @width="25%" 
                          and @valign="top" and @align="center"]
                          /table[@cellspacing="0"]//tr/td/a/@href'''

            self.rules = (
                Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
                Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
                )

            super(spyder, self).__init__(*a, **kw)

        def parse_items(self, response):
            sel = Selector(response)
            items = []
            listings = sel.xpath('//*[@id="tabContent"]/table/tr')

            item = IgeItem()
            item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

            items.append(item)
            return items

I'm pretty sure it's something to do with the URL I'm asking scrapy to follow in the LinkExtractor. When extracting them in shell they looking something like this:

data=u'cart.php?target=category&category_id=826'

Compared to another URL extracted from a working spider:

data=u'/path/someotherpath/category.php?query=someval'

I've had a look at a few questions on Stack Overflow, such as Downloading pictures with scrapy but from reading it I think I may have a slightly different problem.

I also took a look at this - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html

Which explains that the error is thrown up if self.URLs is missing a ":", from looking at the start_urls I've defined I can't quite see why this error would show since the scheme is clearly defined.

4b0
  • 21,981
  • 30
  • 95
  • 142
Toby
  • 350
  • 1
  • 4
  • 10

7 Answers7

30

change start_urls to:

self.start_urls = ["http://www.bankofwow.com/"]
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
  • Thanks for the reply! Do you mean like so: `self.xpath = 'http://www.bankofwow.com/' + '//td[@class="CatBg" and @width="25%" and @valign="top" and @align="center"]/table[@cellspacing="0"]//tr/td/a/@href'` I've tried this and I get the same error unfortunately – Toby Jan 13 '14 at 23:53
  • I do apologise, I had a bit of a brain fart and I said the domain was included in the working spider, this is not the case. – Toby Jan 14 '14 at 00:31
  • 1
    That did the trick, sorry for fudging up the question. Will accept now :) – Toby Jan 14 '14 at 08:41
  • it did not work for me. still same error. but this solved my problem http://stackoverflow.com/questions/27516339/scrapy-error-exceptions-valueerror-missing-scheme-in-request-url – ji-ruh Apr 30 '16 at 20:16
10

prepend url with 'http' or 'https'

Rich Tier
  • 9,021
  • 10
  • 48
  • 71
6

As @Guy answered earlier, start_urls attribute must be a list, the exceptions.ValueError: Missing scheme in request url: h message comes from that: the "h" in the error message is the first character of "http://www.bankofwow.com/", interpreted as a list (of characters)

allowed_domains must also be a list of domains, otherwise you'll get filtered "offsite" requests.

Change restrict_xpaths to

self.xpaths = """//td[@class="CatBg" and @width="25%" 
                    and @valign="top" and @align="center"]
                   /table[@cellspacing="0"]//tr/td"""

it should represent an area in the document where to find links, it should not be link URLs directly

From http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor

restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.

Finally, it's customary to define these as class attributes instead of settings those in __init__:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from bow.items import BowItem

import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

# _*_ coding: utf-8 _*_

class bankOfWow_spider(CrawlSpider):
    name = "bankofwow"

    allowed_domains = ["bankofwow.com"]
    start_urls = ["http://www.bankofwow.com/"]
    xpaths = '''//td[@class="CatBg" and @width="25%"
                  and @valign="top" and @align="center"]
                  /table[@cellspacing="0"]//tr/td'''

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=(xpaths,))),
        Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
        )

    def __init__(self, *a, **kw):
        # catch the spider stopping
        # dispatcher.connect(self.spider_closed, signals.spider_closed)
        # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
        super(bankOfWow_spider, self).__init__(*a, **kw)

    def parse_items(self, response):
        sel = Selector(response)
        items = []
        listings = sel.xpath('//*[@id="tabContent"]/table/tr')

        item = IgeItem()
        item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

        items.append(item)
        return items
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Thanks for the reply :). It's still throwing the same error though. Thanks for the quote from the documentation, I'll be sure to keep that in mind in future! Just in case people are wondering, I've tested the xpaths with Xpath Checker and it is listing the correct links :) – Toby Jan 14 '14 at 00:20
  • When I have some more reputation I'll +1 this because this was useful. Thanks again :) – Toby Jan 14 '14 at 08:41
  • I've just read your revised answer and just like to thank you again! In fact I ran into another little hickup and one of your answers on SO helped me again thanks for that also:). I'll make sure to make the amendments you suggested. Not sure why this answer was negged, could the person perhaps give their reasons? – Toby Jan 14 '14 at 23:45
3

Scheme basically has a syntax like

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

Examples of popular schemes include http(s), ftp, mailto, file, data, and irc. There could also be terms like about or about:blank we are somewhat familiar with.

It's more clear in the description on that same definition page:

                    hierarchical part
        ┌───────────────────┴─────────────────────┐
                    authority               path
        ┌───────────────┴───────────────┐┌───┴────┐
  abc://username:password@example.com:123/path/data?key=value&key2=value2#fragid1
  └┬┘   └───────┬───────┘ └────┬────┘ └┬┘           └─────────┬─────────┘ └──┬──┘
scheme  user information     host     port                  query         fragment

  urn:example:mammal:monotreme:echidna
  └┬┘ └──────────────┬───────────────┘
scheme              path

In the question of Missing schemes it appears that there is [//[user:password@]host[:port]] part missing in

data=u'cart.php?target=category&category_id=826'

as mentioned above.

I had a similar problem where this simple concept would suffice the solution for me!

Hope this helps some.

Snail-Horn
  • 101
  • 11
1

change start_urls to:

self.start_urls = ("http://www.domainname.com/",)

it should work.

GeralexGR
  • 2,973
  • 6
  • 24
  • 33
Crypto营长
  • 149
  • 3
  • 14
0

yield{"Text": text, ^ IndentationError: unindent does not match any outer indentation level

when the error comes using the sublime editor this is using mixed space and tabs space it is difficult to find but a easy solution copy the full code into a ordinary text document

you can easily identify the difference under the for loop and the upcoming statements so you are able to correct it in notepad then copy it in sublime the code will run

nbk
  • 45,398
  • 8
  • 30
  • 47
-1

The error is becauese the start_urls in tuple start_urls = ('http://quotes.toscrape.com/',)

change the statrs_url to list start_urls = ['http://quotes.toscrape.com/']

  • 1
    This is the same solution as the accepted answer, but with a wrong analysis (`start_urls` in the question is a string, not a tuple) – snakecharmerb Nov 07 '19 at 15:28