RocksDB - Double db size after 2 Put operations of same KEY-VALUEs

Question

I have a program that uses RocksDB that tries to write huge amount of pairs of KEY-VALUE to database:

int main() {
DB* db;
Options options;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism(12);
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
// open DB
Status s = DB::Open(options, kDBPath, &db);
assert(s.ok());

for (int i = 0; i < 1000000; i++)
{
    // Put key-value
    s = db->Put(WriteOptions(), "key" + std::to_string(i), "a hard-coded string here");
    assert(s.ok());
}
delete db;
return 0;
}

When I run the program for the first time, it generated about 2GB of database, and I tried running this program for several times, without any changes, I got N*2GB of database with N=number-of-run. Until a certain number of N, the database size started to reduce. But whatever I expected is the new batch of data written to database should be overwritten after each run if the batch doesn't change -> Then the size of database should be ~2GB after each run instead.

QUESTION: is it an issue of RocksDB or if not, what is proper settings for it to keep database's size stable in case of similar written pairs of KEY-VALUE?

score 2 · Accepted Answer · answered Sep 11 '20 at 16:55

A full compaction can reduce the space usage, just add this line before delete db;:

db->CompactRange(CompactRangeOptions(), nullptr, nullptr);

Note: a full compaction do take some time, depends on the data size.

Space amplification is expected, all LSM tree data structure DBs have this issue: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#amplification-factors

Here is a great paper about space amplification research for rocksdb: http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

RocksDB - Double db size after 2 Put operations of same KEY-VALUEs

1 Answers1