I am trying to pull specific URL's on a webpage based on the CSS attribute. I can pull the first one, but I am having difficulties getting the full URL added, or getting more than 1 URL.
I have tried and run into many issues using joinurl or parse. I keep getting global errors with joinurl.
Is there a more simple way of doing this??
I am using Centos 6.5 & Python 2.7.5
This code below will provide the first URL, but not the http://www...inline
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos" # Name of the Spider, required value
start_urls = ["http://www.pdga.com/videos/"]
# Entry point for the spiders
def parse(self, response):
SET_SELECTOR = 'tbody'
for brickset in response.css(SET_SELECTOR):
HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)'
yield {
'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0]
}
Current Output
http://www.pdga.com
/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman
Expected Output
full list of url's without any breaks
I do not have enough reputation points to post a couple examples