12

I'm using django staticfiles + django-storages and Amazon S3 to host my data. All is working well except that every time I run manage.py collectstatic the command uploads all files to the server.

It looks like the management command compares timestamps from Storage.modified_time() which isn't implemented in the S3 storage from django-storages.

How do you guys determine if an S3 file has been modified?

I could store file paths and last modified data in my database. Or is there an easy way to pull the last modified data from Amazon?

Another option: it looks like I can assign arbitrary metadata with python-boto where I could put the local modified date when I upload the first time.

Anyways, it seems like a common problem so I'd like to ask what solution others have used. Thanks!

Yuji 'Tomita' Tomita
  • 115,817
  • 29
  • 282
  • 245

2 Answers2

12

The latest version of django-storages (1.1.3) handles file modification detection through S3 Boto.

pip install django-storages and you're good now :) Gotta love open source!

Update: set the AWS_PRELOAD_METADATA option to True in your settings file to have very fast syncs if using the S3Boto class. If using his S3, use his PreloadedS3 class.


Update 2: It's still extremely slow to run the command.


Update 3: I forked the django-storages repository to fix the issue and added a pull request.

The problem is in the modified_time method where the fallback value is being called even if it's not being used. I moved the fallback to an if block to be executed only if get returns None

entry = self.entries.get(name, self.bucket.get_key(self._encode_name(name)))

Should be

    entry = self.entries.get(name)
    if entry is None:
        entry = self.bucket.get_key(self._encode_name(name))

Now the difference in performance is from <.5s for 1000 requests from 100s


Update 4:

For synching 10k+ files, I believe boto has to make multiple requests since S3 paginates results causing a 5-10 second synch time. This will only get worse as we get more files.

I'm thinking a solution is to have a custom management command or django-storages update where a file is stored on S3 which has the metadata of all other files, which is updated any time a file is updated via the collectstatic command.

It won't detect files uploaded via other means but won't matter if the sole entry point is the management command.

Yuji 'Tomita' Tomita
  • 115,817
  • 29
  • 282
  • 245
  • How do you use modified_time method? Running only ./manage.py collecstatic, does not work for me. It uses the _save method from botos3 to save the files, but it does not check at any time if the file is new or not. What's your solution? – duduklein Feb 07 '12 at 13:03
  • This seems to be no longer true: python-dateutil >2.1 now supports both Python 2 and 3 in a shared codebase and python-dateutil==2.1 works fine for me with botos3. – Henrik Heimbuerger Jan 23 '13 at 07:52
  • Hey Yuji; I'm running into this same issue (really slow collectstatics with S3Boto with several thousand files). I'm wondering where you netted out on this. Could you summarize your current best recommendation(s) to optimize this process, since you've clearly spent a lot of time grappling with this issue? – B Robster May 29 '13 at 17:55
  • Update #3 solved the primary problem for me. AFAIK, the main repository has been fixed. That took down time from 100s to .5s for my load. The remaining issue is pagination time.. but it should be "acceptable" – Yuji 'Tomita' Tomita May 29 '13 at 18:45
  • `AWS_PRELOAD_METADATA` is now [deprecated](https://github.com/jschneier/django-storages/issues/293). If you have a S3 bucket with many files this setting would cause a server to attempt to load a list of all those files and slow down, or in my case, crash. – petroleyum Dec 07 '18 at 00:29
1

I answered the same question here https://stackoverflow.com/a/17528513/1220706 . Check out https://github.com/FundedByMe/collectfast . It's a pluggable Django app that caches the ETag of remote S3 files and compares the cached checksum instead of performing a lookup every time. Follow the installation instructions and run collectstatic as normal. It took me from an average around 1m30s to about 10s per deploy.

Community
  • 1
  • 1
antonagestam
  • 4,532
  • 3
  • 32
  • 44