11

I recently encountered a situation where my CouchDB instance used all available disk space on a 20GB VM instance. Upon investigation I discovered that a directory in /usr/local/var/lib/couchdb/ contained a bunch of .view files, the largest of which was 16GB. I was able to remove the *.view files to restore normal operation. I'm not sure why the .view files grew so large and how CouchDB manages .view files.

A bit more information. I have a VM running Ubuntu 9.10 (karmic) with 512MB and CouchDB 0.10. The VM has a cron job which invokes a Python script which queries a view. The cron job runs once every five minutes. Every time the view is queried the size of a .view file increases. I've written a job to monitor this on an hourly basis and after a few days I don't see the file rolling over or otherwise decreasing in size.

Does anyone have any insights into this issue? Is there a piece of documentation I've missed? I haven't been able to find anything on the subject but that may be due to looking in the wrong places or my search terms.

4 Answers4

13

CouchDB is very disk hungry, trading disk space for performance. Views will increase in size as items are added to them. You can recover disk space that is no longer needed with cleanup and compaction.

Every time you create update or delete a document then the view indexes will be updated with the relevant changes to the documents. The update to the view will happen when it is queried. So if you are making lots of document changes then you should expect your index to grow and will need to be managed with compaction and cleanup.

If your views are very large for a given set of documents then you may have poorly designed views. Alternatively your design may just require large views and you will need to manage that as you would any other resource.

It would be easier to tell what is happening if you could describe what document updates (inc create and delete) are happening and what your view functions are emitting, especially for the large view.

Kerr
  • 2,792
  • 21
  • 31
  • Documents are large and changes to documents are significant. This all makes sense. Thank you for your answer. But doesn't CouchDB cleanup after itself? Or is this left up to the administrator? Seems broken or am I missing something? – Carlos Justiniano Aug 18 '10 at 02:03
  • CouchDB requires that you run compaction to recover disk space. When that can be done is highly dependent on your environment. Typically you would do this when the load on the server is low, triggering it with a cron job. If you have any replicas you should also understand how it may effect replication. – Kerr Aug 18 '10 at 09:42
  • I disagree with "if your views are very large for a given set of documents then you may have poorly designed views". The "may" is there, but the author should stress that a small view is not necessarily a fast for the application. E.g. an op like `?include_docs` is very intense which makes including full documents in the view necessary for performance. This is again where CouchDB trades diskspace for performance. – Till May 08 '11 at 19:02
  • 1
    Well, the very next sentence states that the the app design may just require large views. How much more explicit do you need it? Remember that this is an answer to a question about runaway disk usage in views. It's certainly easy to design a view that creates unnecessarily large indexes if you don't know what your doing. So I think the answer stands as it is. – Kerr May 09 '11 at 14:52
7

That your .view files grow, each time you access a view is because CouchDB updates views on access. CouchDB views need compaction like databases too. If you have frequent changes to your documents, resulting in changes in your view, you should run view compaction from time to time. See http://wiki.apache.org/couchdb/HTTP_view_API#View_Compaction

To reduce the size of your views, have a look at the data, you are emitting. When you emit(foo, doc) the entire document is copied to the view to it is very instantly available when you query the view. the function(doc) { emit(doc.title, doc); } will result in a view as big as the database itself. You could also emit(doc.title, nil); and use the include_docs option to let CouchDB fetch the document from the database when you access the view (which will result in a slightly performance penalty). See http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options

tisba
  • 151
  • 3
3

Use sequential or monotonic id's for documents instead of random

Yes, couchdb is very disk hungry, and it needs regular compactions. But there is another thing that can help reducing this disk usage, specially sometimes when it's unnecessary.

Couchdb uses B+ trees for storing data/documents which is very good data structure for performance of data retrieval. However use of B-tree trades in performance for disk space usage. With completely random Id, B+-tree fans out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes. Also new insertions can cause a rewrite of full tree. That's what randomness can cause ;)

Instead, use of sequential or monotonic ids can avoid all.

0xc0de
  • 8,028
  • 5
  • 49
  • 75
1

I've had this problem too, trying out CouchDB for a browsed-based game.

We had about 100.000 unexpected visitors on the first day of a site launch, and within 2 days the CouchDB database was taking about 40GB in space. This made the server crash because the HD was completely full.

Compaction brought that back to about 50MB. I also set the _revs_limit (which defaults to 1000) to 10 since we didn't care about revision history, and it's running perfectly since. After almost 1M users, the database size is usually about 2-3GB. When i run compaction it's about 500MB.

Setting document revision limit to 10:
curl -X PUT -d "10" http://dbuser:dbpassword@127.0.0.1:5984/yourdb/_revs_limit

Or without user:password (not recommended):
curl -X PUT -d "10" http://127.0.0.1:5984/yourdb/_revs_limit

woudsma
  • 129
  • 2
  • 4