Key Value storage without a file system?

Question

I am working on an application, where we are writing lots and lots of key value pairs. On production the database size will run into hundreds of Terabytes, even multiple Petabytes. The keys are 20 bytes and the value is maximum 128 KB, and very rarely smaller than 4 KB. Right now we are using MongoDB. The performance is not very good, because obviously there is a lot of overhead going on here. MongoDB writes to the file system, which writes to the LVM, which further writes to a RAID 6 array.

Since our requirement is very basic, I think using a general purpose database system is hitting the performance. I was thinking of implementing a simple database system, where we could put the documents (or 'values') directly to the raw drive (actually the RAID array), and store the keys (and a pointer to where the value lives on the raw drive) in a fast in-memory database backed by an SSD. This will also speed-up the reads, as all there would not be no fragmentation (as opposed to using a filesystem.)

Although a document is rarely deleted, we would still have to maintain a pool of free space available on the device (something that the filesystem would have provided).

My question is, will this really provide any significant improvements? Also, are there any document storage systems that do something like this? Or anything similar, that we can use as a starting poing?

I don't think it is possible to sotre data in a binary accessible form without some kihnd of file system to interact with the OS of your choice — Sammaye, Mar 20 '13 at 13:39
Sure you can access a drive or a raid array as a "block device" in Linux and read/write to it directly. — Tarandeep Gill, Mar 20 '13 at 13:55
A block device can have a file system. I think you will find it comes initialised with a EXT or FAT or whatever file system before you attach the device — Sammaye, Mar 20 '13 at 13:56
A block device is merely the type of device being connected, it does not denote whether or not it has a file system — Sammaye, Mar 20 '13 at 13:57
A block device does not need to have a file system, in order to read or write to it. — Tarandeep Gill, Mar 20 '13 at 14:01
If you want the drive to be accessible by an OS' own file system, such that you are accessing the data on the block device from Windows or Linux you will probably (99.99% sure) find that they require to format the drive before you can read/write to it. — Sammaye, Mar 20 '13 at 14:19
I will have to contradict you here. You are wrong here. Try this on a Linux machine. Add a new drive (lets assume its `/dev/sdb`). Assume there is a file `test.txt` with contents `hello world` in your home directory. Give the following commands: `dd if=~/test.txt of=/dev/sdb bs=11 count=1` `dd if=/dev/sdb of=output.txt bs=11 count=1`. These commands will write the file to the raw drive, and then read it back into another file `output.txt`. If you read the output file, its contents will be `hello world`. **Please note** this will destroy the filesystem of the drive, if there was any. — Tarandeep Gill, Mar 20 '13 at 14:26
Hmm I am gonna need to look into that more since source on `dd` don't make it clear exactly how it writes however it does mention that its purpose is to: `convert and copy a file` and that `On Unix, device drivers for hardware (such as hard disks) and special device files (such as /dev/zero and /dev/random) appear in the file system just like normal files` so I am unsure as to your claim but I will investigate ( http://en.wikipedia.org/wiki/Dd_%28Unix%29 ). — Sammaye, Mar 20 '13 at 14:35
Block devices appear as a file, yes, but that does not mean that the block devices have a filesystem. — Tarandeep Gill, Mar 20 '13 at 14:54
At the same time for a disk to be mounted I believe it actually requires a file system in the first place. I need to double check it a sec — Sammaye, Mar 20 '13 at 14:56
Again, a disk has to have a file system to be mounted, yes, BUT it doesn't need to have one to be written or read to! You don't need to mount a disk to read/write to it. You only need to mount if you need to write files to its file system. — Tarandeep Gill, Mar 20 '13 at 15:27
Wait so using your dd example how do you write to a disk that isn't mounted? — Sammaye, Mar 20 '13 at 15:31
You don't have to mount a disk in order to write to it. Are you familiar with Linux? Any new device added to the system shows up as a block device under `/dev`. You can read/write to it directly without mounting. — Tarandeep Gill, Mar 20 '13 at 15:50
Yes I am. Hmmm, I will need to check that in a min, I can understand your example of dding it cos dd will initialise the disk and then write to it but as you noted: "Please note this will destroy the filesystem of the drive, if there was any.", all the linux distros I have used needed a disk to be initialised before it would allow read/writing, aka mounting — Sammaye, Mar 20 '13 at 15:55

score 5 · Answer 1 · answered Mar 20 '13 at 13:36

Apache Cassandra jumps to mind. It's the current elect NoSQL solution where massive scaling is concerned. It sees production usage at several large companies with massive scaling requirements. Having worked a little with it, I can say that it requires a little bit of time to rethink your data model to fit how it arranges its storage engine. The famously citied article "WTF is a supercolumn" gives a sound introduction to this. Caveat: Cassandra really only makes sense when you plan on storing huge datasets and distribution with no single point of failure is a mission critical requirement. With the way you've explained your data, it sounds like a fit.

Also, have you looked into redis at all, at least for saving key references? Your memory requirements far outstrip what a single instance would be able to handle but Redis can also be configured to shard. It isn't its primary use case but it sees production use at both Craigslist and Groupon

Also, have you done everything possible to optimize mongo, especially investigating how you could improve indexing? Mongo does save out to disk, but should be relatively performant when optimized to keep the hottest portion of the set in memory if able.

Is it possible to cache this data if its not too transient?

I would totally caution you against rolling your own with this. Just a fair warning. That's not a knock at you or anyone else, its just that I've personally had to maintain custom "data indexes" written by in house developers who got in way over their heads before. At my job we have a massive on disk key-value store that is a major performance bottleneck in our system that was written by a developer who has since separated from the company. It's frustrating to be stuck such a solution among the exciting NoSQL opportunities of today. Projects like the ones I cited above take advantage of the whole strength of the open source community to proof and optimize their use. That isn't something you will be able to attain working on your own solution unless you make a massive investment of time, effort and promotion. At the very least I'd encourage you to look at all your nosql options and maybe find a project you can contribute to rather than rolling your own. Writing a database server itself is definitely a nontrivial task that needs a huge team, especially with the requirements you've given (but should you end up doing so, I wish you luck! =) )

Actually, the main point I missed in the question is, we are looking for something that replicates data on independent nodes like a RAID 6. All of the NoSQL implementations I have researched have a replication model where you can replicate the value to more than one nodes. We are looking for something which lets you store a value to say 10 nodes, and 8 are required to recover the data. So thats what my implementation was gonna do: user erasure codes for fault tolerance, instead of whole document replication. — Tarandeep Gill, Mar 20 '13 at 13:45
I know you're considering ditching mongo, but it does have this requirement in the form of write preference settings in the driver. You can specify what the replication factor for an individual write would be in order to be considered "successful" http://emptysquare.net/blog/pymongos-new-default-safe-writes/ — DeaconDesperado, Mar 20 '13 at 13:53
Replication is Mongo is "copy all of the data on another node". We are looking for something like "divide the data into 8 chunks, calculate two parity chunks, and store each chunk on 10 nodes". Tahoe LAFS uses this approach. — Tarandeep Gill, Mar 20 '13 at 14:00
Thanks for clearing up. Now that you've detailed a litte more, I can't think of anything in the NoSQL sphere off hand that offers individual entry-level division like that except GridFS (does chunk the data but offers no specification as to what chunks go where), which is impractical for you for other reasons. Wish I could be of more help. — DeaconDesperado, Mar 20 '13 at 14:06

score 0 · Answer 2 · answered Jul 17 '15 at 05:15

0

Late answer, but for future reference I think Spider does this

answered Jul 17 '15 at 05:15

Evan Langlois

4,050
2
20
18

Key Value storage without a file system?

2 Answers2