2

I'm trying to scrape lyrics from The Original Hip Hop Lyrics Archive.

I've managed to write a spider that scrapes the lyrics of an artist if I release it on the artist page such as this: http://www.ohhla.com/anonymous/aesoprck/.

but when I release it on this page with links to different artist pages http://www.ohhla.com/all.html I get nothing.

This is the rule that I'm trying to use to follow the links to artist pages:

Rule(LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True)

and this is the rule I'm trying to use to follow the links to different pages with links to the artist pages:

Rule(LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True)

I modified the tutorial in Scrapy to get this to work since for some reason it didn't work when I started a new project.

Here is my complete working example of the spider:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ohhlaSpider(CrawlSpider):
    name = "ohhla"
    download_delay = 0.5
    allowed_domains = ["ohhla.com"]
    start_urls = ["http://www.ohhla.com/anonymous/aesoprck/"]
    rules = (Rule (LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True), # trying to follow links to pages with more links to artist pages
             Rule (LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True), # trying to follow links to artist pages
             Rule (LinkExtractor(deny_extensions=("txt"),restrict_xpaths=('//ul/li',)), follow= True), # succeeding in following links to album pages
             Rule (LinkExtractor(restrict_xpaths=('//ul/li',)), callback="extract_text", follow= False),) # succeeding in extracting lyrics from the songs on album pages

    def extract_text(self, response):
        """ extract text from webpage"""
        string = response.xpath('//pre/text()').extract()[0]
        with open("lyrics.txt", 'wb') as f:
            f.write(string)
Artturi Björk
  • 3,643
  • 6
  • 27
  • 35

2 Answers2

4

restrict_xpaths should not point to the @href attribute. It should point to the place where the link extractor would search for links:

Rule(LinkExtractor(restrict_xpaths='//h3'), follow=True)

Note that you can specify it as a string instead of a tuple.


You can also allow all the links having all*.html in it:

Rule(LinkExtractor(allow=r'all.*?\.html'), follow=True)

You should also make sure your spider is actually visiting that "Parent Directory" page. Starting crawling with it sounds logical since this is an index page for the catalog:

start_urls = ["http://www.ohhla.com/all.html"]
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

The part two this answer can be useful for crawling specific links in a webpage. https://stackoverflow.com/a/40146522/4418897

Community
  • 1
  • 1
Santosh Pillai
  • 8,169
  • 1
  • 31
  • 27