0

I am using Scrapy to crawl the product image src link of this site:

http://eshop.tesco.com.my/en-GB/Promotion/List?SortBy=Default

For some reasons, the Xpath doesn't grab the product image src links. I tried to crawl all the image src links from the site, by testing it in Scrapy Shell using this Xpath:

response.xpath('//img').extract()

The returned result shows, there are no src link in the img tag for all products.

 [u'<img alt="Grocery Home" class="tLogoMain" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/tLogoMain.gif" title="Grocery Home">',
 u'<img src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/searchFor.png" alt="Search" class="searchFor">',
 u'<img alt="Previous" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/pg-prev-disbl-btn.png">',
 u'<img alt="Next" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/pg-nxt-btn.png">',
 u'<img alt="Grid view" class="grdView" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/high-grd-view.png">',
 u'<img alt="List view" class="lstView" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/unhigh-lst-view.png">',
 u'<img alt="" id="productImg-7072093609">',
 u'<img alt="" id="productImg-7070005656">',
 u'<img alt="" id="productImg-7070005648">',
 u'<img alt="" id="productImg-7000034983">',
 u'<img alt="" id="productImg-7070483892">',
 u'<img alt="" id="productImg-7000035009">',
 u'<img alt="" id="productImg-7000801798">',
 u'<img alt="" id="productImg-7072123710">',
 u'<img alt="" id="productImg-7072123737">',
 u'<img alt="" id="productImg-7072123702">',
 u'<img alt="" id="productImg-7004102002">',
 u'<img alt="" id="productImg-7001314416">',
 u'<img alt="" id="productImg-7001829106">',
 u'<img alt="" id="productImg-7001495593">',
 u'<img alt="" id="productImg-7001812165">',
 u'<img alt="" id="productImg-7001813226">',
 u'<img alt="" id="productImg-7002760339">',
 u'<img alt="" id="productImg-7001812157">',
 u'<img alt="" id="productImg-7002800969">',
 u'<img alt="" id="productImg-7002764067">',
 u'<img alt="" id="productImg-7001866206">',
 u'<img alt="" id="productImg-7070980683">',
 u'<img alt="" id="productImg-7072086912">',
 u'<img alt="" id="productImg-7001884344">',
 u'<img alt="Previous" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/pg-prev-disbl-btn.png">',
 u'<img alt="Next" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/pg-nxt-btn.png">',
 u'<img src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/btn-bookslot-bskt-d.gif" class="delSlotBtn" alt="Book slot disabled">',
 u'<img src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/btn-checkout-bskt-d.gif" class="chkOutBtn" alt="Checkout disabled">',
 u'<img alt="" class="legendImg" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/star.png" title="">',
 u'<img alt="" class="legendImg" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/star.png" title="">',
 u'<img alt="Opens in a new window" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/open-window.png" title="Opens in a new window">',
 u'<img src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/btn-fulltrolley-bskt-d.gif" class="fullTrolleyBtn" alt="">',
 u'<img alt="Add to list" class="slAddToListDsbld" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/i368/dsbld_sl_addtolst_icn.png">',
 u'<img alt="Tesco Strapline" src="http://assets.ap-tescoassets.com/UIAssets/MY/grocery/default/en-GB/i368/footer/strapline_footer_bottom_my.png" title="Tesco Strapline">']

I checked again using Chrome Inspector, there are src links for each product. Why there are no src links in the returned results?

Please help.

Thanks.

Insane Skull
  • 9,220
  • 9
  • 44
  • 63
Tatt Ehian
  • 79
  • 1
  • 7
  • If you were like me.. you were doing web crawling... But in order to get xpath to work you did things like regex replaces on a chunk of the html, so you could put it through an xml parser to do the xpath.. things. like adding / at end of the img element to keep pairs.. But in doing so I also did a find and replace of the src element to reduce the clutter... then much later on I needed it. It was a facepalm moment. But I wouldn't be surprised if this was something akin to what you may have been doing too. (I use Notepad++ and XML plugin/pretty print to ensure all elements have matching data). – JGFMK May 06 '22 at 22:06

2 Answers2

0

This is because of javascript rendering, the plain text of the site you are visiting doesn't contain that information, but in the loading process it is being filled by javascript scripts.

You can check that too installing some Toggle Javascript extension on your browser, so you can check what is really being downloaded without javascript.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • I see...thanks for the information. Is there another way to scrape those links? – Tatt Ehian Dec 11 '15 at 12:46
  • user firebug (on firefox) or chrome chrome developer tools, to check which request returns the information you need, or you can use `selenium` to load a page like a browser. – eLRuLL Dec 11 '15 at 12:50
0

It could be because it gets more than one node with the xpath - '//img'.

Try with the following xpath to get the specific node : .//img[contains(src,'{{specific value of src}}')]

NDP
  • 22
  • 4