How to get a clean result when scraping a data from website using scrapy

Question

I am new in python and I am trying to scrape a data from yellow pages. I was able to scrape it but I get a messed result.

This was the result i got:

2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines: 
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

How could I get a clean result? I just want to get the name, address, phone number and links only.

By the way, the code I'm using to do this, was;

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//ol[@class="result"]/li')
        items = []
        for title in titles:
            item = EypItem()
            item['title'] = title.select(".//p/text()").extract()
            item['link'] = title.select(".//a/@href").extract()
            items.append(item)
        return items

It seems like in `item['title']` you are choosing EVERY `
` element inside chosen `
Read up on [Item Loaders](http://doc.scrapy.org/en/latest/topics/loaders.html). — Steven Almeroth, Mar 24 '13 at 18:19

score 2 · Answer 1 · answered Mar 24 '13 at 15:19

Your code is a bit messy but I will try to help:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import string

class EypItem(Item):
    name = Field()
    address = Field()
    phone = Field()

class eypSpider(BaseSpider):
    name = "eyp.ph"
    allowed_domains = ["eyp.ph"]
    start_urls = ["http://www.eyp.ph/home-real-estate/search/real-estate/davao/cat/real-estate-brokers"]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//li/div[@class='details']")
        items = []
        for site in sites:
            itemE = EypItem()
            itemE["name"] = site.select("normalize-space(p[1]/text())").extract()
            itemE["address"] = site.select("normalize-space(p[2]/text())").extract()
            itemE["phone"] = site.select("normalize-space(p[3]/text())").extract()
            items.append(itemE)
        return items

You are missing a definition for class EypItem. I have suggested one. With the above saved as test.py running a command line:

$ scrapy runspider test.py -o items.json -t json

Will give you a file with JSON output named items.json. A sample of output is

[{"phone": ["Phone: +63(907)6390603"], "name": ["(CARLOS A. VARGAS)"], "address": ["Mezzanine Wee Eng Apartment, Guerrero Street, Davao City, Davao Del Sur"]},
 {"phone": ["Phone: +63(921)9566577"], "name": ["(ROGELIO G. CARBIERO)"], "address": ["Sto. Nino Heights, Pantinople Village, Davao City, Davao Del Sur"]},
 {"phone": ["Phone: +63(917)3137855"], "name": ["(FLORIZEL C. CHAVEZ)"], "address": ["12 Tulip Street, El Rio Vista Village P4a, Davao City, Davao Del Sur"]},
..........

She has the definition in another file. `from eyp.items import EypItem` But good points there. What is the `normalize-space` deal there? — tonino.j, Mar 24 '13 at 15:23
The `normalize-space` was to remove white space using the xpath call. As noted elsewhere `Item Loaders` maybe be a more appropriate method to do this. — user1609452, Mar 24 '13 at 18:40

How to get a clean result when scraping a data from website using scrapy

1 Answers1