0

scrapy shell 'https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4'

I wanted to get album "no tears left to cry - Single" from here,

Itunes chart _ music preview page "no tears left to cry - Single / Ariana Grande"

the album name's xpath is this : //*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1

and i tried to

response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1')

but result was []

how can I get album informations from this wepsite?

Druta Ruslan
  • 7,171
  • 2
  • 28
  • 38
Roy Kim
  • 3
  • 1

2 Answers2

0

This is because scrapy don't wait for javascript load, you need to use scrapy-splash, here is my answer how you need to setup you scrapy-project with scrapy-splash

If i use scrapy-splash i get the results

2018-06-30 20:50:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27 via http://localhost:8050/render.html> (referer: None)
2018-06-30 20:50:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27>
{'title': 'no tears left to cry - Single'}

Here is my simple spider

import scrapy
from scrapy_splash import SplashRequest


class TestSpider(scrapy.Spider):
    name = "test"

    start_urls = ['https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                                callback=self.parse,
                                endpoint='render.html',
                                )

    def parse(self, response):
        yield {'title': response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1//text()').extract_first()}

Also you can do this with scrapy shell

scrapy shell 'http://localhost:8050/render.html?url=https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4'

In [2]: response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1//text()').extract_first()
Out[2]: 'no tears left to cry - Single'
Druta Ruslan
  • 7,171
  • 2
  • 28
  • 38
0

You'd better avoid JS rendering, which is damn slow, heavy and buggy. Spend 5 minutes in Chrome's "network" tab to find the source of data. It is usually built-in to the source of page or delivered via XHR requests.

In this case, all the data you want can be found on the page itself, but you should check its source code, not the rendered version. Use ctrl+u in chrome and then ctrl+f to find all the needed parts.

import json

track_data = response.xpath('//script[@name="schema:music-album"]/text()').extract_first()
track_json = json.loads(track_data)
track_title = track_json['name']
yield {'title': track_title}

Will do the trick in this case and will work about 5-7 times faster than splash

Michael Savchenko
  • 1,445
  • 1
  • 9
  • 13
  • can I ask one more thing? / it works well in Itunes but when I tried to parse https://www.fifa.com/worldcup/players/browser/ to scrape players'name, there are no script codes including 'name'. how can i solve this problem? – Roy Kim Jul 03 '18 at 04:55
  • I would suggest you to build a simple tiny javascript based website to better understand principles of data delivery: ajax, server-side rendering, etc. Just as a sort of self-education. It will be really interesting and useful (: Believe me. | In this case, as I said previously algorithm is like: Go to chrome -> open source page -> found nothing -> go to inspect tools -> network -> XHR -> refresh page -> got three XHR's Which are: https://www.fifa.com/worldcup/players/_libraries/byposition/all/_players-list https://www.fifa.com/worldcup/players/_libraries/43922/_players-list – Michael Savchenko Jul 03 '18 at 12:02
  • wow....amazing..finally I realize about its web structure.. I followed your guide slowly, in the end, I got what i really want...! I learned XHR and network in inspect tools from your kind answer. thank you so much Michael!!! – Roy Kim Jul 04 '18 at 15:45
  • Welcome! Good luck! (: – Michael Savchenko Jul 04 '18 at 16:03