Best way to model millions of exist checks in Aerospike?

Question

Having grown out of Redis for some data structures I'm looking to other solutions with good disk/SSD performance. I recently discovered Aerospike which seems to excel in an SSD environment.

One of the most memory hungry structures are about 100.000 Redis sets, which can each contain up to 10.000 strings. Each string is between 10 and 30 characters.

These sets are mostly used for exists / uniqueness checks.

What would be the best way to model these? I generally see 2 options: * model a redis set as an Aerospike lset * model each value in a set separately.

Besides this choice, the 100.000 Redis sets are used as a partitioning on the keys. For reasons of locality it would probably make sense to have a similar sort of partitioning/namespacing in Aerospike. However, I'm pretty sure the notion of 'namespacing' in Aerospike isn't used for this sort of key partitioning. What would be a correct way (if any) to do this in Aerospike, or is that not needed?

score 6 · Accepted Answer · answered Oct 28 '14 at 11:48

Aerospike does its own partitioning for load balancing and high availability needs. Namespace is synonymous to Database in traditional sense and NOT to Partition of data. Data in a Namespace is partitioned and stored in cluster. You as a user need not worry about placement of the data.

I would map a Redis set to Aerospike "lset" (one to one). Aerospike should takes care of data locality for the data in a given "lset".

sunil · Answer 2 · 2015-01-03T16:33:21.983

Yes, you should not be worrying about the locality of the data as Aerospike does auto-sharding. This ensures equal balancing of data distribution and read/write load across all nodes of the cluster.

Putting in lset has its advantages. It gives functionality similar to redis where you do not need to write your own functionality. But at the same time it has its disadvantes too. So, you should choose based on your requirements. All the operations on a single set will be serialized. So, if you are expecting the read/wirte to the set to be parallelised, lset may not be the right fit for you. Also, the exists check in lset will actually read the full record and return true false. Aerospike has an exists api for normal keys, which will return true/false based on the in-memory index which is way faster.

For this usecase, you may not be able to segregate them into the 'sets' of aerospike. You need 100,000 sets. But as of now, Aerospike only supports 1024 sets.

Let me add a third option to your list. You can model the key itself to create virtual sets for you as below:

if you actual key is key1 and you want it to go to set1, you can set your mashed keys as set1_key1.
when you want to search for existence of key7 in set5, search for existence of set5_key7

If you go with this model, you are exploiting Aerospike's data-distribution, and load balancing to its best. The exists check will be the fastest as there will be no I/O.

thanks. Going with your 3rd option, this would actually be just dumping every -pair (100.000 * 10.000 = 1.000.000.000) correct? Just to be sure, isn't this a prohibitive nr of keys? I.e.: I can imagine that Aerospike keeps some in-mem entry per key for indicing? If so, the mem requirement for only that aspect would already be far more then my future nodes can handle. Concern? Also, does the format `_` have a special meaning in Aerospike, e.g.: telling the sharding-code to keep all -pairs with the same `_` prefix together on the same shard? — Geert-Jan, Oct 29 '14 at 16:07
This is not a prohibitive number for keys as aerospike can support upto 2^160 keys. Aerospike takes about 64 bytes of memory per index entry. So, the RAM requirement just for index is about 64GB which is not too bad if you put 3-4 nodes.But this is assuming the worst case i.e. 100k * 10k. If they are less in reality, the need will be less accordingly. The _ has not no meaning for aerospike. for aerospike, this is a normal key. The sharding is based on teh key. So, keys of a set will not be together. You should not aim for that too unless you have a different reason. — sunil, Oct 29 '14 at 16:35

Best way to model millions of exist checks in Aerospike?

2 Answers2