0

Problem:

I am using scrapy splash to scrape a web page. However it seems the css path for imageURL does not return any element but the ones for name and category worked fine. (xpath and selector are all copied directly from Chrome.)

Things I've Tried:

At first I thought it's because the page has not been fully loaded when parse gets called so I changed the wait argument for SplashRequest to 5 but it did not help. I also downloaded a copy of the html response from splash GUI (http://localhost:8050) and verified that the xpath/selectors all work well on the downloaded copy. Here I assumed that this html would be exactly what scrapy sees in parse so I couldn't make sense of why it wouldn't work inside scrapy script.

Code:

Here is my code:

class NikeSpider(scrapy.Spider):
name = 'nike'
allowed_domains = ['nike.com', 'store.nike.com']
start_urls = ['https://www.nike.com/t/air-vapormax-flyknit-utility-running-shoe-XPTbVZzp/AH6834-400']

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest (
            url=url,
            callback=self.parse,
            args= {
                'wait': 5
            }
        )

def parse(self, response):

    name = response.xpath('//*[@id="RightRail"]/div/div[1]/div[1]/h1/text()').extract_first()
    imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > img::attr(src)').extract_first()
    category = response.css('#RightRail > div > div.d-lg-ib.mb0-sm.mb8-lg.u-full-width > div.ncss-base.pr12-sm > h2::text').extract_first()
    url = response.url


    if name != None and imageURL != None and category != None:
        item = ProductItem()
        item['name'] = name
        item['imageURL'] = imageURL
        item['category'] = category
        item['URL'] = url

        yield item
Tinyik
  • 457
  • 6
  • 21

1 Answers1

0

May they use different formatting but for me it's (source::attr(srcset) at the end):

imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > source::attr(srcset)').extract_first()
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • Thanks for answering! I think css selector is not the problem here - my selector does get the img element I needed when I test in browser. Did I miss anything? – Tinyik May 27 '18 at 00:47
  • I have tested in browser and in Scrapy – gangabass May 27 '18 at 00:49
  • Why mine won’t work in Scrapy? It works in browser too. – Tinyik May 27 '18 at 00:50
  • The only reason I can see it's some kind of RANDOM HTML formatting – gangabass May 27 '18 at 00:51
  • Sometimes it's `img` but sometimes `source` etc – gangabass May 27 '18 at 00:51
  • Try to save HTML from Scrapy (I mean Splash response HTML) and check – gangabass May 27 '18 at 00:52
  • You mean the response object in parse function? – Tinyik May 27 '18 at 00:54
  • It seems the problem is scrapy parser added some tags automatically, which caused the selector to fail. (i.e. can't find the img element). tags are not closed on the original website. Here is what running `#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3)` gives me and you can clearly see there IS an img tag. – Tinyik May 27 '18 at 01:47
  • I believe it's a bug in splash parser. tags are added at incorrect locations, which cause selectors to fail. @gangabass – Tinyik May 27 '18 at 01:52
  • My original selector works after removing the `>` before img. Incorrect tags are failing this child combinator selector. – Tinyik May 27 '18 at 01:57