3

I have a program that uses RocksDB that tries to write huge amount of pairs of KEY-VALUE to database:

int main() {
DB* db;
Options options;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism(12);
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
// open DB
Status s = DB::Open(options, kDBPath, &db);
assert(s.ok());

for (int i = 0; i < 1000000; i++)
{
    // Put key-value
    s = db->Put(WriteOptions(), "key" + std::to_string(i), "a hard-coded string here");
    assert(s.ok());
}
delete db;
return 0;
}

When I run the program for the first time, it generated about 2GB of database, and I tried running this program for several times, without any changes, I got N*2GB of database with N=number-of-run. Until a certain number of N, the database size started to reduce. But whatever I expected is the new batch of data written to database should be overwritten after each run if the batch doesn't change -> Then the size of database should be ~2GB after each run instead.

QUESTION: is it an issue of RocksDB or if not, what is proper settings for it to keep database's size stable in case of similar written pairs of KEY-VALUE?

user4157124
  • 2,809
  • 13
  • 27
  • 42
HuyLuyen
  • 43
  • 3

1 Answers1

2

A full compaction can reduce the space usage, just add this line before delete db;:

db->CompactRange(CompactRangeOptions(), nullptr, nullptr);

Note: a full compaction do take some time, depends on the data size.

Space amplification is expected, all LSM tree data structure DBs have this issue: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#amplification-factors

Here is a great paper about space amplification research for rocksdb: http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

Jay Zhuang
  • 321
  • 4
  • 9