11

I am new with scrapy I would like to extract all the content of each advertise from this website. So I tried the following:

from scrapy.spiders import Spider
from craigslist_sample.items import CraigslistSampleItem

from scrapy.selector import Selector
class MySpider(Spider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        links = response.selector.xpath(".//*[@id='sortable-results']//ul//li//p")
        for link in links:
            content = link.xpath(".//*[@id='titletextonly']").extract()
            title = link.xpath("a/@href").extract()
            print(title,content)

items:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    link = Field()

However, when I run the crawler I got nothing:

$ scrapy crawl --nolog craig
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

Thus, my question is: How can I walk over each url, get inside each link and crawl the content and the title?, and which is the best way of do this?.

Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48
student
  • 347
  • 3
  • 13
  • How did you come up with the XPaths? `.//*[@id='sortable-results']//ul//li//p` looks _ok_, it should give you the `

    ` on the page. But within those `

    `, I cannot see something matching `.//*[@id='titletextonly']`. You can test your XPaths with `scrapy shell`

    – paul trmbrth Nov 08 '16 at 09:15
  • Examples of scrapy usage or of XPath? I believe https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-quotes-and-authors is quite similar to your use-case. – paul trmbrth Nov 08 '16 at 17:57
  • Every website is different, and the data you're after is your use-case. Perhaps you want a course on XPath and [this blog post](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/) can serve as a good intro. – paul trmbrth Nov 09 '16 at 08:59
  • I will accept the question who have more upvotes, both were great. – student Nov 15 '16 at 17:47

2 Answers2

14

To scaffold a basic scrapy project you can use the command:

scrapy startproject craig

Then add the spider and items:

craig/spiders/spider.py

from scrapy import Spider
from craig.items import CraigslistSampleItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy import Request
import urlparse, re

class CraigSpider(Spider):
    name = "craig"
    start_url = "https://sfbay.craigslist.org/search/npo"

    def start_requests(self):

        yield Request(self.start_url, callback=self.parse_results_page)


    def parse_results_page(self, response):

        sel = Selector(response)

        # Browse paging.
        page_urls = sel.xpath(""".//span[@class='buttons']/a[@class='button next']/@href""").getall()

        for page_url in page_urls + [response.url]:
            page_url = urlparse.urljoin(self.start_url, page_url)

            # Yield a request for the next page of the list, with callback to this same function: self.parse_results_page().
            yield Request(page_url, callback=self.parse_results_page)

        # Browse items.
        item_urls = sel.xpath(""".//*[@id='sortable-results']//li//a/@href""").getall()

        for item_url in item_urls:
            item_url = urlparse.urljoin(self.start_url, item_url)

            # Yield a request for each item page, with callback self.parse_item().
            yield Request(item_url, callback=self.parse_item)


    def parse_item(self, response):

        sel = Selector(response)

        item = CraigslistSampleItem()

        item['title'] = sel.xpath('//*[@id="titletextonly"]').extract_first()
        item['body'] = sel.xpath('//*[@id="postingbody"]').extract_first()
        item['link'] = response.url

        yield item

craig/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    body = Field()
    link = Field()

craig/settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'craig'

SPIDER_MODULES = ['craig.spiders']
NEWSPIDER_MODULE = 'craig.spiders'

ITEM_PIPELINES = {
   'craig.pipelines.CraigPipeline': 300,
}

craig/pipelines.py

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class CraigPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_ads.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

You can run the spider by running the command:

scrapy runspider craig/spiders/spider.py

From the root of your project.

It should create a craig_ads.csv in the root of your project.

Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48
  • Thanks for the help Ivan. Indeed it would be helpful for the community another example of how to do it via pipeline, could you provide us an example of how to do it? – student Nov 15 '16 at 14:18
  • 1
    Sure! What's the format you need the output to be? CSV? JSON? An Excel file? – Ivan Chaer Nov 15 '16 at 14:20
  • Let's say I need a CSV – student Nov 15 '16 at 14:29
  • Also I tried to `scrapy crawl myspider -o items.csv` and the file is empty... any idea of why this is happening? – student Nov 15 '16 at 14:43
  • 1
    It's because we weren't yielding the results in the end of `parse_item()` (we were just printing the items for the sake of the example). I added the `yield`, and also added the pipeline to export `CSV` files. Please, let me know if this helps. – Ivan Chaer Nov 15 '16 at 14:46
  • I runned it like: `scrapy runspider craig/spiders/test.py` and I did not got the file, why is this?. – student Nov 15 '16 at 15:00
  • Could be for a couple of reasons. Did you update `craig/settings.py`, as indicated? After running the spider, you normally get a small report. On this report, do you have a line that says `'item_scraped_count': 101,`? Did you look for the file on the same path where you have by default the file `scrapy.cfg`? – Ivan Chaer Nov 15 '16 at 15:33
  • Ohhh!... sure! let me check – student Nov 15 '16 at 15:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128191/discussion-between-ivan-chaer-and-student). – Ivan Chaer Nov 15 '16 at 17:19
5

I am trying to answer your question.

First of all, because of your incorrect XPath query, you got blank results. By XPath ".//*[@id='sortable-results']//ul//li//p", you located relevant <p> nodes correctly, though I don't like your query expression. However, I have no idea of your following XPath expression ".//*[@id='titletextonly']" and "a/@href", they couldn't locate link and title as you expected. Maybe your meaning is to locate the text of title and the hyperlink of the title. If yes, I believe you have to learn Xpath, and please start with HTML DOM.

I do want to instruct you how to do XPath query, as there are lots of resources online. I would like to mention some features of Scrapy XPath selector:

  1. Scrapy XPath Selector is an improved wrapper of standard XPath query.

In standard XPath query, it returns an array of DOM nodes you queried. You can open development mode of your browser(F12), use console command $x(x_exp) to test. I highly suggest that test your XPath expression through this way. It will give you instant results and save lots of time. If you have time, be familiar with the web development tools of your browser, which will have you quick understand web page structure and locate the entry you are looking for.

While, Scrapy response.xpath(x_exp) returns an array of Selector objects corresponding to actual XPath query, which is actually a SelectorList object. This means XPath results is reprented by SelectorsList. And both Selector and SelectorList class provides some useful functions to operate the results:

  • extract, return a list of serialized document nodes (to unicode strings)
  • extract_first, return scalar, first of the extract results
  • re, return a list, re of the extract results
  • re_first, return scalar, first of the re results.

These functions make your programming much more convenient. One example is that you can call xpath function directly on SelectorList object. If you tried lxml before, you would see that this is super useful: if you want to call xpath function on the results of a former xpath results in lxml, you have to iterate over the former results. Another example is that when you definitely sure that there is at most one element in that list, you can use extract_first to get a scalar value, instead of using list index method (e.g., rlist[0]) which would cause out of index exception when no element matched. Remember that there are always exceptions when you parse the web page, be careful and robust of your programming.

  1. Absolute XPath vs. relative XPath

Keep in mind that if you are nesting XPathSelectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the XPathSelector you’re calling it from.

When you do operation node.xpath(x_expr), if x_expr starts with /, it is an absolute query, XPath will search from root; else if x_expr starts with ., it is a relative query. This is also noted in standards 2.5 Abbreviated Syntax

. selects the context node

.//para selects the para element descendants of the context node

.. selects the parent of the context node

../@lang selects the lang attribute of the parent of the context node

  1. How to follow the next page and end of following.

For your application, you probably need to following the next page. Here, the next page node is easy to locate -- there are next buttons. However, you need also take care of the time to stop following. Look carefully for your URL query parameter to tell the URL pattern of your application. Here, to determine when to stop follow the next page, you can compare current item range with the total number of items.

New Edited

I was a little confused with the meaning of content of the link. Now I got it that @student wanted to crawl the link to extract AD content as well. The following is a solution.

  1. Send Request and attach its parser

As you may notice that I use Scrapy Request class to follow the next page. Actually, the power of Request class is beyond that -- you can attach desired parse function for each request by setting parameter callback.

callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.

In step 3, I did not set callback when sending next page requests, as these request should be handled by default parse function. Now comes to the specified AD page, a different page then the former AD list page. Thus we need to define a new page parser function, let's say parse_ad, when we send each AD page request, attach this parse_ad function with the requests.

Let's go to the revised sample code that works for me:

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapydemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()


class AdItem(scrapy.Item):
    title = scrapy.Field()
    description = scrapy.Field()

The spider

# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.http import Request
from scrapydemo.items import ScrapydemoItem
from scrapydemo.items import AdItem
try:
    from urllib.parse import urljoin
except ImportError:
    from urlparse import urljoin


class MySpider(Spider):
    name = "demo"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        # locate list of each item
        s_links = response.xpath("//*[@id='sortable-results']/ul/li")
        # locate next page and extract it
        next_page = response.xpath(
            '//a[@title="next page"]/@href').extract_first()
        next_page = urljoin(response.url, next_page)
        to = response.xpath(
            '//span[@class="rangeTo"]/text()').extract_first()
        total = response.xpath(
            '//span[@class="totalcount"]/text()').extract_first()
        # test end of following
        if int(to) < int(total):
            # important, send request of next page
            # default parsing function is 'parse'
            yield Request(next_page)

        for s_link in s_links:
            # locate and extract
            title = s_link.xpath("./p/a/text()").extract_first().strip()
            link = s_link.xpath("./p/a/@href").extract_first()
            link = urljoin(response.url, link)
            if title is None or link is None:
                print('Warning: no title or link found: %s', response.url)
            else:
                yield ScrapydemoItem(title=title, link=link)
                # important, send request of ad page
                # parsing function is 'parse_ad'
                yield Request(link, callback=self.parse_ad)

    def parse_ad(self, response):
        ad_title = response.xpath(
            '//span[@id="titletextonly"]/text()').extract_first().strip()
        ad_description = ''.join(response.xpath(
            '//section[@id="postingbody"]//text()').extract())
        if ad_title is not None and ad_description is not None:
            yield AdItem(title=ad_title, description=ad_description)
        else:
            print('Waring: no title or description found %s', response.url)

Key Note

A snapshot of output:

2016-11-10 21:25:14 [scrapy] DEBUG: Scraped from <200 http://sfbay.craigslist.org/eby/npo/5869108363.html>
{'description': '\n'
                '        \n'
                '            QR Code Link to This Post\n'
                '            \n'
                '        \n'
                'Agency History:\n' ........
 'title': 'Staff Accountant'}
2016-11-10 21:25:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 39259,
 'downloader/request_count': 117,
 'downloader/request_method_count/GET': 117,
 'downloader/response_bytes': 711320,
 'downloader/response_count': 117,
 'downloader/response_status_count/200': 117,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2016, 11, 11, 2, 25, 14, 878628),
 'item_scraped_count': 314,
 'log_count/DEBUG': 432,
 'log_count/INFO': 8,
 'request_depth_max': 2,
 'response_received_count': 117,
 'scheduler/dequeued': 116,
 'scheduler/dequeued/memory': 116,
 'scheduler/enqueued': 203,
 'scheduler/enqueued/memory': 203,
 'start_time': datetime.datetime(2016, 11, 11, 2, 24, 59, 242456)}
2016-11-10 21:25:14 [scrapy] INFO: Spider closed (shutdown)

Thanks. Hope this would be helpful and have fun.

Community
  • 1
  • 1
rojeeer
  • 1,991
  • 1
  • 11
  • 13
  • 1
    Thanks for the help. What I still do not understand is how to extract the content of each link. – student Nov 11 '16 at 01:25
  • 1
    @student what do you mean by extracting content of a link. Do you want to go to that link and crawl some content? – rojeeer Nov 11 '16 at 01:29
  • the full content of the announcement (i.e. the text and the description of the advertise ) – student Nov 11 '16 at 01:51
  • The whole content of each link. – student Nov 11 '16 at 01:58
  • 1
    @student All right. Now I understand, you want to get the advertise page. It's not difficult, just send request of the url you already extract and add a new parse for these request. I will modify my code. Wait and see. – rojeeer Nov 11 '16 at 02:09
  • 1
    @student, I updated the answer. Check it to see whether working for you. – rojeeer Nov 11 '16 at 02:47
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127904/discussion-between-rojeeer-and-student). – rojeeer Nov 11 '16 at 14:20
  • Just to accept the answer, could you provide the items.py?, I get an unresolved reference error, and this will benefit the community. – student Nov 16 '16 at 14:45
  • 1
    Of course. Items.py is quite simply. I added for you. – rojeeer Nov 16 '16 at 22:22
  • Thanks, when I do `scrapy crawl demo -o file.csv` I do not get into a csv file the advertise content... why? – student Nov 17 '16 at 03:10
  • 1
    You need your own handler function in pipelines.py to store the items. – rojeeer Nov 18 '16 at 00:29