I'm trying to scrape lyrics from The Original Hip Hop Lyrics Archive.
I've managed to write a spider that scrapes the lyrics of an artist if I release it on the artist page such as this: http://www.ohhla.com/anonymous/aesoprck/.
but when I release it on this page with links to different artist pages http://www.ohhla.com/all.html I get nothing.
This is the rule that I'm trying to use to follow the links to artist pages:
Rule(LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True)
and this is the rule I'm trying to use to follow the links to different pages with links to the artist pages:
Rule(LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True)
I modified the tutorial in Scrapy to get this to work since for some reason it didn't work when I started a new project.
Here is my complete working example of the spider:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ohhlaSpider(CrawlSpider):
name = "ohhla"
download_delay = 0.5
allowed_domains = ["ohhla.com"]
start_urls = ["http://www.ohhla.com/anonymous/aesoprck/"]
rules = (Rule (LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True), # trying to follow links to pages with more links to artist pages
Rule (LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True), # trying to follow links to artist pages
Rule (LinkExtractor(deny_extensions=("txt"),restrict_xpaths=('//ul/li',)), follow= True), # succeeding in following links to album pages
Rule (LinkExtractor(restrict_xpaths=('//ul/li',)), callback="extract_text", follow= False),) # succeeding in extracting lyrics from the songs on album pages
def extract_text(self, response):
""" extract text from webpage"""
string = response.xpath('//pre/text()').extract()[0]
with open("lyrics.txt", 'wb') as f:
f.write(string)