I'm currently playing around with RocksDB (C++) and was curious about some performance metrics I've experienced.
For testing purposes, my database keys are file paths and the values are filenames. My database has around 2M entries in it. I'm running RocksDB locally on a MacBook Pro 2016 (SSD).
My use case is dominated by reads. Full key scans are quite common as are key scans that include a "significant" number of keys. (50%+)
I'm curious about the following observations:
1. An Iterator
is dramatically faster than calling Get
when performing full key scans.
When I want to look at all of the keys in the database, I'm seeing a 4-8x performance improvement when using an Iterator
instead of calling Get
for each key. The use of MultiGet
makes no difference.
In the case of calling Get
roughly 2M times, the keys have been previously fetched into a vector and sorted lexicographically. Why is calling Get
repeatedly so much slower than using an Iterator
? Is there a way to narrow the performance gap between the two APIs?
2. When fetching around half the keys, the performance between using an Iterator
and Get
starts to become negligible.
As the number of keys to fetch is reduced, then making multiple calls to Get
starts to take about as long as using an Iterator
as the iterator is paying the price of scanning over keys that aren't in the desired keyset.
Is there some "magic" ratio where this becomes true for most databases? For example, if I need to scan over 25% of the keys, then calling Get
is faster, but if it's 75% of the keys, then an Iterator
is faster. But those numbers are just "made up" by rough testing.
3. Fetching keys in sorted order does not appear to improve performance.
If I pre-sort the keys I want to fetch into the same order that an Iterator
would return them in, that does not appear to make calling Get
multiple times any faster. Why is that? It's mentioned in the documentation that it's recommended to sort keys before doing a batch insert. Does Get
not benefit from the same look-ahead caching that an Iterator
benefits from?
4. What settings are recommended for a read-heavy use case?
Finally, are there any specific settings recommended for a read-heavy use case that might involve scanning a significant number of keys at once?
macOS 10.14.3, MacBook Pro 2016 SSD, RocksDB 5.18.3, Xcode 10.1