16

I was wondering why my CouchDB database was growing to fast so I wrote a little test script. This script changes an attributed of a CouchDB document 1200 times and takes the size of the database after each change. After performing these 1200 writing steps the database is doing a compaction step and the db size is measured again. In the end the script plots the databases size against the revision numbers. The benchmarking is run twice:

  • The first time the default number of document revision (=1000) is used (_revs_limit).
  • The second time the number of document revisions is set to 1.

The first run produces the following plot

first run

The second run produces this plot

second run

For me this is quite an unexpected behavior. In the first run I would have expected a linear growth as every change produces a new revision. When the 1000 revisions are reached the size value should be constant as the older revisions are discarded. After the compaction the size should fall significantly.

In the second run the first revision should result in certain database size that is then keeps during the following writing steps as every new revision leads to the deletion of the previous one.

I could understand if there is a little bit of overhead needed to manage the changes but this growth behavior seems weird to me. Can anybody explain this phenomenon or correct my assumptions that lead to the wrong expectations?

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
konrad
  • 409
  • 4
  • 14
  • 1
    I figured out that the compaction takes some seconds while my script records the size of the database immediately after requesting the compaction. I added some waiting time to my script and now the drop of the size after the compaction is correctly recorded as expected. This doesn't change much on the main problem (the rapid growth) but should be noted here. – konrad May 28 '10 at 14:58

1 Answers1

4

First off, CouchDB saves some information even for deleted revisions (just the ID and revision identifier), because it needs this for replication purposes.

Second, inserting documents one at a time is suboptimal because of the way the data is saved on disk (see WikiPedia), this could explain the superlinear growth in the first graph.

djc
  • 11,603
  • 5
  • 41
  • 54
  • Thanks for your response! While I agree that such meta data is generated I doubt that this would cause the grow of the database up to 30 MB in the first run. – konrad May 28 '10 at 14:22
  • 2
    I asked the question at the couchdb mailing list. There somebody assumes that this underlying data handling it the reason for the growth (http://bit.ly/abPCzQ), too. So it looks like you were right. Thanks again! – konrad Jun 02 '10 at 09:27