1

I am scraping a website which returns in a list of urls. Example - scrapy crawl xyz_spider -o urls.csv

It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable?

Nikhil Parmar
  • 876
  • 2
  • 11
  • 27

3 Answers3

2

Unfortunately scrapy can't do this at the moment.
There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547

However you can easily do redirect the output to stdout and redirect that to a file:

scrapy crawl myspider -t json --nolog -o - > output.json

-o - means output to minus and minus in this case means stdout.
You can also make some aliases to delete the file before running scrapy, something like:

alias sc='-rm output.csv && scrapy crawl myspider -o output.csv'
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • I have two spiders in my spider folder and the file is being read by the other spider when I run the above spider it creates problem in other spider and nothing exceutes – Nikhil Parmar Oct 30 '16 at 10:41
  • its like executing one spider but both the spider runs i don;t know about this weird behaviour – Nikhil Parmar Oct 30 '16 at 10:41
  • Could you elaborate a bit more? You have to spiders that run concurrently and 1 writes to urls.csv and the other one reads from it? which approach are you trying to use to output the csv? – Granitosaurus Oct 30 '16 at 11:51
  • Its more like a different thing but anyways its worth asking I have two spiders a and b. I first run a and get output.csv, i then run spider b to get url from output.csv to crawl and i was trying to use command rm output.csv && scrapy crawl a -o output.csv but spider b throws error when i am running a that it couldn't find output.csv – Nikhil Parmar Oct 30 '16 at 11:59
  • @NikhilParmar does `scrapy crawl a -o output.csv` produce anything by itself? – Granitosaurus Oct 30 '16 at 16:51
2

I usually tackle custom file exports by running Scrapy as a python script and opening a file before calling up the Spider Class. This gives greater flexibility with handling and formatting your csv files and even running them as a extension to a web-app or running them the cloud. Something in the lines of the following:

import csv

if __name__ == '__main__':            
        process = CrawlerProcess()

        with open('Output.csv','wb') as output_file:
            mywriter = csv.write(output_file)
            process.crawl(Spider_Class, start_urls = start_urls)
            process.start() 
            process.close()                             
quasarseeker
  • 167
  • 9
0

You can open the file and close it so it will remove the content of the file.

class RestaurantDetailSpider(scrapy.Spider):

    file = open('./restaurantsLink.csv','w')
    file.close()
    urls = list(open('./restaurantsLink.csv')) 
    urls = urls[1:]
    print "Url List Found : " + str(len(urls))

    name = "RestaurantDetailSpider"
    start_urls = urls

    def safeStr(self, obj):
        try:
            if obj == None:
                return obj
            return str(obj)
        except UnicodeEncodeError as e:
            return obj.encode('utf8', 'ignore').decode('utf8')
        return ""

    def parse(self, response):
        try :
            detail = RestaurantDetailItem()
            HEADING = self.safeStr(response.css('#HEADING::text').extract_first())
            if HEADING is not None:
                if ',' in HEADING:
                    HEADING = "'" + HEADING + "'"
                detail['Name'] = HEADING

            CONTACT_INFO = self.safeStr(response.css('.directContactInfo *::text').extract_first())
            if CONTACT_INFO is not None:
                if ',' in CONTACT_INFO:
                    CONTACT_INFO = "'" + CONTACT_INFO + "'"
                detail['Phone'] = CONTACT_INFO

            ADDRESS_LIST = response.css('.headerBL .address *::text').extract()
            if ADDRESS_LIST is not None:
                ADDRESS = ', '.join([self.safeStr(x) for x in ADDRESS_LIST])
                ADDRESS = ADDRESS.replace(',','')
                detail['Address'] = ADDRESS

            EMAIL = self.safeStr(response.css('#RESTAURANT_DETAILS .detailsContent a::attr(href)').extract_first())
            if EMAIL is not None:
                EMAIL = EMAIL.replace('mailto:','')
                detail['Email'] = EMAIL

            TYPE_LIST = response.css('.rating_and_popularity .header_links *::text').extract()
            if TYPE_LIST is not None:
                TYPE = ', '.join([self.safeStr(x) for x in TYPE_LIST])
                TYPE = TYPE.replace(',','')
                detail['Type'] = TYPE

            yield detail
        except Exception as e:
            print "Error occure"
            yield None

    scrapy crawl RestaurantMainSpider  -t csv -o restaurantsLink.csv

this will create the restaurantsLink.csv file which I am using in my next spider RestaurantDetailSpider.

So you can run the following command -- it will remove and create a new file restaurantsLink.csv which we are going to use in the above spider and it will be overridden whenever we run the spider:

rm restaurantsLink.csv && scrapy crawl RestaurantMainSpider -o restaurantsLink.csv -t csv
trincot
  • 317,000
  • 35
  • 244
  • 286
krishna chandak
  • 391
  • 5
  • 6