0

I am new in python and I am trying to scrape a data from yellow pages. I was able to scrape it but I get a messed result.

This was the result i got:

2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

How could I get a clean result? I just want to get the name, address, phone number and links only.

By the way, the code I'm using to do this, was;

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//ol[@class="result"]/li')
        items = []
        for title in titles:
            item = EypItem()
            item['title'] = title.select(".//p/text()").extract()
            item['link'] = title.select(".//a/@href").extract()
            items.append(item)
        return items
cnu
  • 36,135
  • 23
  • 65
  • 63
  • 1
    It seems like in `item['title']` you are choosing EVERY `

    ` element inside chosen `

  • `. Should you maybe select your desire content more precise? Should your items really just have `title` and `link` if you want to scrape `name`, `phone number`, `address`, `link` ?? Shouldnt you more precisely also select WHICH link you want scraped? Not EVERY link, like you also did with `

    `? You should study basic manual before you ask for help, dont you think?

  • – tonino.j Mar 24 '13 at 15:08
  • I gave you 3 problems here that I see. – tonino.j Mar 24 '13 at 15:12
  • 1
    Read up on [Item Loaders](http://doc.scrapy.org/en/latest/topics/loaders.html). – Steven Almeroth Mar 24 '13 at 18:19