10

Is there a way to overwrite the said file instead of appending it?

Example)

scrapy crawl myspider -o "/path/to/json/my.json" -t json    
scrapy crawl myspider -o "/path/to/json/my.json" -t json

Will append the my.json file instead of overwrite it.

hooliooo
  • 528
  • 1
  • 5
  • 13

6 Answers6

18
scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Thank you! This is what I was looking for. So the simple "- >" part overwrites the file? – hooliooo Nov 05 '15 at 07:42
  • 1
    -o - : redirects to standard output, and > redirects the standard ouput to a new file with following path. I used it and worked weirdly, like i got unvalid json output. – miguelfg May 27 '16 at 11:16
  • Any idea why this doesn't work when I call it using subprocess.check_output, inside a Docker container? Command '['scrapy', 'crawl', 'spider_name', '-t', 'json', '--nolog', '-o', '-', '>', 'output.json', '-a', 'url=url.jpg]' returned non-zero exit status 2 –  May 18 '17 at 09:27
  • 1
    See @Samuel Elh answer, upgrade to Scrapy 2.4+ for easy -O overwrite command line option – phillipsK Dec 28 '20 at 18:23
9

There is a flag which allows overwriting the output file, you can do so by passing the file reference via -O option instead of -o, so you can use this instead:

scrapy crawl myspider -O /path/to/json/my.json

More information:

$ scrapy crawl --help
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  append scraped items to the end of FILE (use - for
                        stdout)
--overwrite-output=FILE, -O FILE
                        dump scraped items into FILE, overwriting any existing
                        file
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
Ismail
  • 725
  • 8
  • 23
  • 4
    That's true, however the `-O` option is very new. Not sure, but I think it was introduced in version 2.4.0, so it is probably not available for everybody yet. – Patrick Klein Nov 22 '20 at 01:02
7

To overcome this problem I created a subclass from scrapy.extensions.feedexport.FileFeedStorage in myproject dir.

This is my customexport.py:

"""Custom Feed Exports extension."""
import os

from scrapy.extensions.feedexport import FileFeedStorage


class CustomFileFeedStorage(FileFeedStorage):
    """
    A File Feed Storage extension that overwrites existing files.

    See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79
    """

    def open(self, spider):
        """Return the opened file."""
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        # changed from 'ab' to 'wb' to truncate file when it exists
        return open(self.path, 'wb')

Then I added the following to my settings.py (see: https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base):

FEED_STORAGES_BASE = {
    '': 'myproject.customexport.CustomFileFeedStorage',
    'file': 'myproject.customexport.CustomFileFeedStorage',
}

Now every time I write to a file it gets overwritten because of this.

robkorv
  • 549
  • 7
  • 5
  • nice solution. Is it a good idea to redefine FEED_STORAGES_BASE in your settings.py? In this case, `scrapy crawl ` commands would still be facing the issue. – hAcKnRoCk Jul 21 '17 at 09:06
  • I would call it `OverwriteFileFeedStorage`. – Suor Sep 23 '17 at 05:21
4

This is an old, well-known "problem" of Scrapy. Every time you start a crawl and you do not want to keep the results of previous calls you have to delete the file. The idea behind this is that you want to crawl different sites or the same site at different time-frames so you could accidentally lose your already gathered results. Which could be bad.

A solution would be to write an own item pipeline where you open the target file for 'w' instead of 'a'.

To see how to write such a pipeline look at the docs: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for JSON exports: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file)

GHajba
  • 3,665
  • 5
  • 25
  • 35
  • Can I do something similar with an exporter.py script where I instantiate a custom JsonItemExporter class with my edits? ( I'm still a novice programmer so I don't know if I'm saying it correctly) and then add self.file = open(file, 'wb')? I'm not sure if that's the correct way either – hooliooo Oct 15 '15 at 07:13
0

Since, the accepted answer gave me problems with unvalid json, this could work:

find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"
miguelfg
  • 1,455
  • 2
  • 16
  • 21
0

Or you can add:

import os

if "filename.json" in os.listdir('..'):
        os.remove('../filename.json')

at the beginning of your code.

very easy.

Omar Omeiri
  • 1,506
  • 1
  • 17
  • 33