Scrapy parse html strings from webpage based on css attribute

Question

I am trying to pull specific URL's on a webpage based on the CSS attribute. I can pull the first one, but I am having difficulties getting the full URL added, or getting more than 1 URL.

I have tried and run into many issues using joinurl or parse. I keep getting global errors with joinurl.

Is there a more simple way of doing this??

I am using Centos 6.5 & Python 2.7.5

This code below will provide the first URL, but not the http://www...inline

import scrapy

class PdgaSpider(scrapy.Spider):
name = "pdgavideos"  # Name of the Spider, required value

start_urls = ["http://www.pdga.com/videos/"]

# Entry point for the spiders
def parse(self, response):
    SET_SELECTOR = 'tbody'
    for brickset in response.css(SET_SELECTOR):

        HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)'
        yield {
            'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0]
        }

Current Output

http://www.pdga.com
/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman

Expected Output

full list of url's without any breaks

I do not have enough reputation points to post a couple examples

Tiny.D · Answer 1 · 2017-05-06T03:35:34.580

Your code return a dictionary, that's why it is break:

{'http://www.pdga.com': u'/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

what you could do is to make the yield this dictionary like this:

yield {
    'href_link':'http://www.pdga.com'+brickset.css(HTML_SELECTOR).extract()[0]
}

This will give you a new dict with the value is no break href.

{'href_link': u'http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

Note: Spider must return Request, BaseItem, dict or None, refer to parse function.

vold · Accepted Answer · 2017-05-06T14:26:12.350

1

In order to get absolute urls from relative links, you can use Scrapy urljoin() method and rewrite your code like this:

import scrapy

class PdgaSpider(scrapy.Spider):
    name = "pdgavideos"
    start_urls = ["http://www.pdga.com/videos/"]

    def parse(self, response):
        for link in response.xpath('//td[2]/a/@href').extract():
            yield scrapy.Request(response.urljoin(link), callback=self.parse_page)

        # If page contains link to next page extract link and parse
        next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_page(self, response):
        link = response.xpath('//iframe/@src').extract_first()
        yield{
            'you_tube_link': 'http:' + link.split('?')[0]
        }

# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c

edited May 06 '17 at 14:26

answered May 06 '17 at 07:27

vold

1,549
1
13
19

Thank you both Tiny.D and vold for your quick response! This is exactly what I was looking to achieve. vold: am i able to output the data without the word link or anything else displayed before the results? – Thomas May 06 '17 at 09:14
You are welcome. As @Tiny.D already pointed out: Scrapy must return either new Request or Item or dictionary. If you want to simply output string with url in the console you better use `requests` with `bs4` or `lxml` parsers. – vold May 06 '17 at 10:17
@Thomas I edited my answer to provide more desired output. – vold May 06 '17 at 12:28
Thank you soooooo much vold. If I could give you more points, then I would =D – Thomas May 08 '17 at 05:15

Scrapy parse html strings from webpage based on css attribute

2 Answers2