Scrapy returning blank csv

Question

This is my first time using scrapy and I'm trying to put the information I need into a csv file using the pipeline. Everything seemed to be working fine until I tried to scrape from more than one page and it started to return a blank csv file. I think the problem is in the spider (as when I made changes there it stopped working right), but I'm putting the pipeline up just in case there's something wrong in there also.

here's my spider:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from indeed.items import IndeedItem


class IndeedSpider(CrawlSpider):
    name = "indeed"
    allowed_domains = ['www.indeed.com']
    start_urls = [
    'http://www.indeed.com/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=19103&fromage=30&limit=10&sort=&psf=advsrch'
    ]
    rules = (Rule(LinkExtractor(allow=('http://www.indeed.com/jobs?q=&l=19103&radius=50&fromage=30&start=.*'))), )

    def parse_item(self, response):
        for sel in response.xpath("//div[contains(@class, 'row ')]"):
            items = []
            jobs = sel.xpath('//a[contains(@data-tn-element, "jobTitle")]/text()').extract()
            city = sel.xpath('//span[@class="location"]/text()').extract()
            company = sel.xpath('//span[@class="company"]/text()').extract()
            for j, c, co in zip(jobs, city, company):
                position = IndeedItem()
                position['jobs'] = j.strip()
                position['city'] = c.strip()
                position['company'] = co.strip()
                items.append(position)
            yield items

And here's my pipeline:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class IndeedPipeline(object):
    def process_item(self, item, spider):
        return item


class CsvExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_jobs.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Any help would be much appreciated.

What are the stats when the crawl finishes? It should say how many items have been scraped etc. Maybe all of your items are dropped? — Granitosaurus, Jul 11 '16 at 05:50
Ok, so it does say `[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)`. So that seems to be part of the problem. — Matthew Barnette, Jul 12 '16 at 23:31
there is an issue with your crawl indeed. The way you usually debug this is add `inspect_response(response, self)` somewhere in the parse function and during the crawl scrapy will throw out a shell mode that you can use to inspect `response` object. i.e. check whether your xpath finds anything. — Granitosaurus, Jul 13 '16 at 08:37

score 0 · Accepted Answer · answered Jul 13 '16 at 08:42

0

Your LinkExtractor is not extracting any links.

if you do: scrapy shell "http://www.indeed.com/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=19103&fromage=30&limit=10&sort=&psf=advsrch"

and recreate your link extractor in the scrapy shell:

from scrapy.linkextractor import LinkExtractor
le = LinkExtractor(allow=('http://www.indeed.com/jobs?q=&l=19103&radius=50&fromage=30&start=.*'))
le.extract_links(response)

you'll notice that it doesn't extract anything.
You should check the official documentation on LinkExtractors and try to build a working one using the same approach then use it in your spider.

answered Jul 13 '16 at 08:42

Granitosaurus

20,530
5
57
82

Thanks, I'll look into that. – Matthew Barnette Jul 15 '16 at 00:20
@MatthewBarnette feel free to accept the answer if you find it sufficient :) – Granitosaurus Jul 15 '16 at 07:08
Sorry, first time posting on stack. Still sorting out protocol. – Matthew Barnette Jul 16 '16 at 19:24

score 0 · Answer 2 · answered Aug 11 '16 at 08:08

0

try this:(delete //)

    jobs = sel.xpath('a[contains(@data-tn-element, "jobTitle")]/text()').extract()
    city = sel.xpath('span[@class="location"]/text()').extract()
    company = sel.xpath('span[@class="company"]/text()').extract()

answered Aug 11 '16 at 08:08

Tracy

1

Scrapy returning blank csv

2 Answers2