20

I need help to convert relative URL to absolute URL in Scrapy spider.

I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion?

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/billboard",
        "http://www.example.com/billboard?page=1"
    ]

def parse(self, response):
    image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
    relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()

    for image_url, url in zip(image_urls, absolute_urls):
        item = ExampleItem()
        item['image_urls'] = image_urls

    request = Request(url, callback=self.parse_dir_contents)
    request.meta['item'] = item
    yield request
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jacquesseite
  • 515
  • 3
  • 12
  • 20
    `response.urljoin(relative_url)` will do the trick, it's a wrapper around the urljoin method from urlparse but without importing the urlparse library., Very handy. – Steve Mar 18 '16 at 13:50

1 Answers1

34

There are mainly three ways to achieve that:

  1. Using urljoin function from urllib:

    from urllib.parse import urljoin
    # Same as: from w3lib.url import urljoin
    
    url = urljoin(base_url, relative_url)
    
  2. Using the response's urljoin wrapper method, as mentioned by Steve.

    url = response.urljoin(relative_url)
    
  3. If you also want to yield a request from that link, you can use the handful response's follow method:

    # It will create a new request using the above "urljoin" method
    yield response.follow(relative_url, callback=self.parse)
    
Paulo Romeira
  • 751
  • 7
  • 11