Scrape "paginasamarillas.es" using Scrapy

Question

Hi I using scrapy for scrape paginasamarillas.es but I don't get results these are my codes.Please can you help me with this?

from scrapy.item import Item, Field

class AyellItem(Item):
name = Field()
pass

This is the spider

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from ayell.items import AyellItem

    class YellSpider(CrawlSpider):
    name = 'yell'
    allowed_domains = ['http://www.paginasamarillas.es']
    start_urls = ['http://www.paginasamarillas.es/alimentacion/all-ma/all-pr/all-is/all-ci/all-ba/all-pu/all-nc/1']



    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        directors = hxs.select("/html/body")
        items = []
        for directors in directors:
            item = AyellItem()
            item ["name"] = directors.select("/h1").extract()   
            items.append(item)
            return items

and this what i get

2015-07-31 19:11:25-0300 [yell] DEBUG: Crawled (200) http://www.paginasamarillas.es/alimentacion/all-ma/all-pr/all-is/all-ci/all-ba/all-pu/all-nc/1> (referer: None) 2015-07-31 19:11:25-0300 [yell] INFO: Closing spider (finished) 2015-07-31 19:11:25-0300 [yell] INFO: Dumping spider stats: {'downloader/request_bytes': 267, 'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 30509, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 31, 22, 11, 25, 731485), 'scheduler/memory_enqueued': 1,

score 0 · Answer 1 · answered Aug 01 '15 at 01:50

First off, it looks like this is a new spider. If you're able to, I'd recommend updating to Scrapy 1.0.1 instead of staying with 0.24 (or lower).

AyellItem has an indentation error, though this may just be how you typed it into SO. Additionally, there's no purpose for your pass.

As for the spider itself, there are a few notable issues:

You're not specifying any rules. The spider will not process any links after retrieving the first page.
You're not parsing the content of the first page. In order to do so, you need to override the parse_start_url(response) method.
You're XPath selectors don't work for the provided page. There is only one <h1> element on the page, and it is not at /html/body/h1. The items you want are list items (<li>'s) nested within a div with the class "contenido"

Reading up on the CrawlSpider, Scrapy Selectors, and overall familiarize yourself better with the technologies you're using should help you out. Best of luck!

Scrape "paginasamarillas.es" using Scrapy

1 Answers1