Scrapy recursively scraping craigslist

Question

I am using scrapy to scrap craigslist and get all links, go to that link, store the description for each page and email for reply. Now I have written a scrapy script which gors through the craigslist/sof.com and gets all job titles and urls. I want to go inside each url and save the email and description per job. Heres my code :

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/npo/"]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
                title = titles.select("a/text()").extract()
                link = titles.select("a/@href").extract()
                desc = titles.select("a/replylink").extract
                print link, title

Any ideas how to do this ?

How to do what exactly? Is there something wrong with your code or do you need more code? — rickhg12hs, Nov 26 '13 at 02:10
I need more code. This code is fine. I want to recurse through the link and then scrap data of those links. — Scooby, Nov 26 '13 at 03:16

score 1 · Answer 1 · answered Nov 26 '13 at 04:20

scrapy functions should yield (or return) Item/s and Request/s

a returned Item will be pipelined according to configuration, next spider step is determined by returning a Request with a reference to the function in the callback field

from scrapy documentation:

def parse_page1(self, response):
    return Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.log("Visited %s" % response.url)

score 1 · Answer 2 · answered Jul 18 '16 at 21:10

Scraping craigslist is illegal as per their policy:

Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited

source: https://www.craigslist.org/about/terms.of.use

Now their API is another question, however that only gets updated every hour (so there's a lag time of 1 hour).

Scrapy recursively scraping craigslist

2 Answers2