0

This is my first time using scrapy and I'm trying to put the information I need into a csv file using the pipeline. Everything seemed to be working fine until I tried to scrape from more than one page and it started to return a blank csv file. I think the problem is in the spider (as when I made changes there it stopped working right), but I'm putting the pipeline up just in case there's something wrong in there also.

here's my spider:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from indeed.items import IndeedItem


class IndeedSpider(CrawlSpider):
    name = "indeed"
    allowed_domains = ['www.indeed.com']
    start_urls = [
    'http://www.indeed.com/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=19103&fromage=30&limit=10&sort=&psf=advsrch'
    ]
    rules = (Rule(LinkExtractor(allow=('http://www.indeed.com/jobs?q=&l=19103&radius=50&fromage=30&start=.*'))), )

    def parse_item(self, response):
        for sel in response.xpath("//div[contains(@class, 'row ')]"):
            items = []
            jobs = sel.xpath('//a[contains(@data-tn-element, "jobTitle")]/text()').extract()
            city = sel.xpath('//span[@class="location"]/text()').extract()
            company = sel.xpath('//span[@class="company"]/text()').extract()
            for j, c, co in zip(jobs, city, company):
                position = IndeedItem()
                position['jobs'] = j.strip()
                position['city'] = c.strip()
                position['company'] = co.strip()
                items.append(position)
            yield items

And here's my pipeline:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class IndeedPipeline(object):
    def process_item(self, item, spider):
        return item


class CsvExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_jobs.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Any help would be much appreciated.

  • 1
    What are the stats when the crawl finishes? It should say how many items have been scraped etc. Maybe all of your items are dropped? – Granitosaurus Jul 11 '16 at 05:50
  • Ok, so it does say `[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)`. So that seems to be part of the problem. – Matthew Barnette Jul 12 '16 at 23:31
  • there is an issue with your crawl indeed. The way you usually debug this is add `inspect_response(response, self)` somewhere in the parse function and during the crawl scrapy will throw out a shell mode that you can use to inspect `response` object. i.e. check whether your xpath finds anything. – Granitosaurus Jul 13 '16 at 08:37

2 Answers2

0

Your LinkExtractor is not extracting any links.

if you do: scrapy shell "http://www.indeed.com/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=19103&fromage=30&limit=10&sort=&psf=advsrch"

and recreate your link extractor in the scrapy shell:

from scrapy.linkextractor import LinkExtractor
le = LinkExtractor(allow=('http://www.indeed.com/jobs?q=&l=19103&radius=50&fromage=30&start=.*'))
le.extract_links(response)

you'll notice that it doesn't extract anything.
You should check the official documentation on LinkExtractors and try to build a working one using the same approach then use it in your spider.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
0

try this:(delete //)

    jobs = sel.xpath('a[contains(@data-tn-element, "jobTitle")]/text()').extract()
    city = sel.xpath('span[@class="location"]/text()').extract()
    company = sel.xpath('span[@class="company"]/text()').extract()
Tracy
  • 1