1

I am trying to use urlparse.urljoin within a Scrapy spider to compile a list of urls to scrape. Currently, my spider is returning nothing, but not throwing any errors. So I am trying to check that I am compiling the urls corectly.

My attempt was to test this in idle using str.join, as below:

>>> href = ['lphs.asp?id=598&city=london',
 'lphs.asp?id=480&city=london',
 'lphs.asp?id=1808&city=london',
 'lphs.asp?id=1662&city=london',
 'lphs.asp?id=502&city=london',]
>>> for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, x)
    print(final_url)

A one line of what that returns:

lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n

I think that from my example it is obvious that str.join does not behave in the same way - if it does then there this is why my spider is not following these links! - however, it would be good to have confirmation on that.

If this is not the right way to test, how can I test this process?

Update Attempt using urlparse.urljoin below: from urllib.parse import urlparse

    >>> from urllib.parse import urlparse
    >>> for x in href:
        base = "http:/www.url-base.com/destination/"
        final_url = urlparse.urljoin(base, x)
        print(final_url)

Which is throwing AttributeError: 'function' object has no attribute 'urljoin'

Update - the spider function in question

def parse_links(self, response): 
    room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms 
    for link in room_links:
        base_url = "http://www.example.com/followthrough"
        final_url = urlparse.urljoin(base_url, link)
        print(final_url)
        # This is not joing the final_url right
        yield Request(final_url, callback=parse_links)

Update

I just tested again in idle:

>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
 'lphs.asp?id=1706&city=london',
 'lphs.asp?id=1826&city=london',
 'lphs.asp?id=541&city=london',
 'lphs.asp?id=1672&city=london',
 'lphs.asp?id=509&city=london',
 'lphs.asp?id=428&city=london',
 'lphs.asp?id=614&city=london',
 'lphs.asp?id=336&city=london',
 'lphs.asp?id=412&city=london',
 'lphs.asp?id=611&city=london',]
>>> for link in room_links:
    base_url = "http:/www.url-base.com/destination/"
    final_url = urlparse.urljoin(base_url, link)
    print(final_url)

Which threw this:

Traceback (most recent call last):
  File "<pyshell#34>", line 3, in <module>
    final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'
Maverick
  • 789
  • 4
  • 24
  • 45
  • If your `room_links` is showing okay things and `base_url` is set correctly, then that should be fine... How about the rest of your spider... Is `parse_links` being called correctly and does it really need to yield a callback with itself? If anything - if it starts crawling, it looks like it'll just keep crawling and yield no data anyway. Do you have a `start_requests` or `start_urls` defined for instance? – Jon Clements Oct 18 '17 at 14:07
  • @JonClements The base url is set correctly, if I take it and append the relative href manually it works. I'm using `start_urls`rather than `start_requests`. But, I don't think that the function is working correctly - see the update for what happens when I run it in idle? – Maverick Oct 19 '17 at 10:01

1 Answers1

1

You see the output given because of this:

for x in href:
    base = "http:/www.url-base.com/destination/"
    final_url = str.join(base, href)   # <-- 'x' instead of 'href' probably intended here
    print(final_url)

urljoin from the urllib library behaves differently, just see the documentation. It's not simple string concatenation.

EDIT: Based on your comment, I suppose you are using Python 3. With that import statement, you import a urlparse function. That's why you get that error. Either import and use directly the function:

from urllib.parse import urljoin
...
final_url = urljoin(base, x)

or import parse module and use the function like this:

from urllib import parse
...
final_url = parse.urljoin(base, x)
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • Yes you are right, I meant 'x' there - Have updated the question and output. Thanks for confirming my suspicion, how can I test the `urlparse.urljoin`- when I try to run this in idle I get `AttributeError: 'function' object has no attribute 'urljoin'` - I'll add in to the question. – Maverick Oct 18 '17 at 07:36
  • How do your `import` look like and what Python version are you using? There was a change between Python 2 and 3 regarding `urllib` library. – Tomáš Linhart Oct 18 '17 at 07:48
  • It is `from urllib.parse import urlparse` – Maverick Oct 18 '17 at 07:50
  • Why are you even using `str.join` in that form and in this case? What's wrong with `final_url = base + x` ? Also note that in the case of `response` objects, you can use the `.follow` method when yielding where to go next and it'll automatically resolve the full url for you. – Jon Clements Oct 18 '17 at 07:59
  • Thank you @TomášLinhart that works in idle, yet my spider is not crawling links or even printing them to the terminal - do you have an suggestions on that one? – Maverick Oct 18 '17 at 12:35
  • Yes @JonClements, realising that now. Is [this](https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow) what you refer to? Not sure how I can implement that on my spider? – Maverick Oct 18 '17 at 12:37
  • @Maverick if the issue is with the spider not crawling anything, you should really show the actual spider code to give the context of what you're trying to do - there might be other issues and you're on a wild goose chase. – Jon Clements Oct 18 '17 at 12:39
  • @JonClements I agree, this was my [original problem](https://stackoverflow.com/questions/46724482/scrapy-notimplementederror/46724915). I've implemented the suggestions and am fairly confident the spider itself is set up correctly. So what I'm not trying to do is make sure each part works interdependently, in an attempt to figure out why its not running joined up. – Maverick Oct 18 '17 at 12:55
  • @Maverick umm... and do you have `def start_requests(self): yield from (scrapy.Request(urljoin(self.base, href)) for href in self.hrefs)` present? Obviously make `self.base` what the base is and `self.hrefs` your list of things? – Jon Clements Oct 18 '17 at 13:02
  • @JonClements, hmm not 100% sure what you mean here. I've put my function in the question. Would I not pass my function `(self, response)`? – Maverick Oct 18 '17 at 13:42