web-crawling - get item-title from bandcamp.com

Question

I try to get the item-title from new releases at bandcamp.com from the 'Discover' part of the page (rock->all rock->new arrivals)

scrapy shell 'https://bandcamp.com/?g=rock&s=new&p=0&gn=0&f=all&w=0'

Part of the relevant source code of the page looks like this:

<div class="col col-3-12 discover-item">
            <a data-bind="click: playMe, css: { 'playing': playing }" class="item-link playable">
                <span class="item-img ratio-1-1">
                    <img class="art" data-bind="src_art: { 'art_id': artId, 'format': 'art_tags_large' }" src="https://f4.bcbits.com/img/a1631562669_9.jpg">
                    <span class="plb-btn">
                        <span class="plb-bg"></span>
                        <span class="plb-ic"></span>
                    </span>
                </span>
                </a><a data-bind="attr: { 'href': itemURL }, text: title, click: playMe" class="item-title" href="https://reddieseloff.bandcamp.com/album/dead-rebel?from=discover-new">Dead Rebel</a>
                <a data-bind="attr: { 'href': bandURL }, text: artist, click: playMe" class="item-artist" href="https://reddieseloff.bandcamp.com?from=discover-new">Red Diesel</a>
                <span class="item-genre" data-bind="text: genre">rock</span>

        </div>

I tried to get the text of item-title (in this example 'Dead Rebel') with the help of xpath:

 response.xpath('//div[@class="col col-3-12 discover-item"]//a[@class="item-title"]/text()').extract()

but it returns nothing.

[]

It's also not working for 'item-artist' so i wonder what i'm doing wrong.

I appreciate any help.

Ì'm not used to scrapy, but can you try `//a[@class="item-title"]`? Also, using `bs4` and the provided `html` I could get the `Dead Rebel` text that you want. Are you interested? Maybe you can mix some `bs4` and `scrapy` code... — dot.Py, Mar 28 '17 at 18:02
@dot.Py `bs4` does exactly the same thing scrapy's `parsel` does, so it wouldn't change much. — Granitosaurus, Mar 28 '17 at 18:07

score 2 · Accepted Answer · answered Mar 28 '17 at 18:02

2

All of the data you seek is hidden in the a hidden div node inside of the page body.
When your browser loads the webpage, javascript instructs how to unpack and display this data and since scrapy does not run any javscript you need to do this step yourself:

 # all of the data is under "<div id="pagedata" data-blob=" attribute
 data = response.css('div#pagedata::attr(data-blob)').extract()
 import json
 data = json.loads(data[0])
 # dig through this python dictionary to find your data   
 (it has pretty much everything, even more than the page displays)

answered Mar 28 '17 at 18:02

Granitosaurus

20,530
5
57
82

This really has pretty much everything. Trying to figure out with pprint whats in the dict `from pprint import pprint` `out = open('dict.txt', 'w+')` `pprint(data, out)` – fuser60596 Mar 28 '17 at 20:06
1

you can put the data to file with `json.dumps(data, indent=2)` to "pretty print it to file" and then you can inspect the data with some text editor or even software that is dedicated to view json trees, for example this online one: http://jsonviewer.stack.hu/ – Granitosaurus Mar 28 '17 at 20:39

web-crawling - get item-title from bandcamp.com

1 Answers1