I am new in python and I am trying to scrape a data from yellow pages. I was able to scrape it but I get a messed result.
This was the result i got:
2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
How could I get a clean result? I just want to get the name, address, phone number and links only.
By the way, the code I'm using to do this, was;
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ol[@class="result"]/li')
items = []
for title in titles:
item = EypItem()
item['title'] = title.select(".//p/text()").extract()
item['link'] = title.select(".//a/@href").extract()
items.append(item)
return items
` element inside chosen `
`? You should study basic manual before you ask for help, dont you think?