0

I was doing some experiments to understand Riak. Here's something intersting I found:

I have a cluster of 2 nodes and a bucket type that has n_val of 2

[root@co-riak002 ~]# riak-admin ring-status
================================== Claimant ===================================
Claimant:  'riak@10.172.48.68'
Status:     up
Ring Ready: true

============================== Ownership Handoff ==============================
No pending changes.

============================== Unreachable Nodes ==============================
All nodes are up and reachable

[root@co-riak002 ~]# riak-admin bucket-type create testBucket '{"props":{"n_val":2}}'
testBucket created
[root@co-riak002 ~]# riak-admin bucket-type activate testBucket                        
testBucket has been activated

And then I wrote something into it:

[root@co-riak002 ~]# curl -XPUT -d '{"bar":"foo"}' -H "Content-Type: application/json" http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?w=2&returnbody=true
[1] 10890
[root@co-riak002 ~]# 
[1]+  Done                    curl -XPUT -d '{"bar":"foo"}' -H "Content-Type: application/json" http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?w=2

Now I can read it fine with both r=2 and pr=2:

[root@co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?r=2
{"bar":"foo"}
[root@co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?pr=2
{"bar":"foo"}

After I killed one of the nodes, r=2 still reads fine but not pr=2

[root@co-riak002 ~]# riak-admin ring-status
================================== Claimant ===================================
Claimant:  'riak@10.172.48.68'
Status:     up
Ring Ready: true

============================== Ownership Handoff ==============================
No pending changes.

============================== Unreachable Nodes ==============================
The following nodes are unreachable: ['riak@10.172.48.66']

With r=2:

[root@co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?r=2
{"bar":"foo"}

With pr=2:

[root@co-riak002 ~]# curl http://localhost:8098/types/testBucket/buckets/stuff/keys/hello?pr=2
PR-value unsatisfied: 1/2

I am confused - shouldn't the Quorum number r used in reading operation mean the number of replicas/physical nodes that need to agree before returning data? Why is it not working in this case? And why is pr working in this case when it should mean the number of vnodes?

Am pretty new to this space. Much appreciated for any pointers.

saladinxu
  • 372
  • 1
  • 2
  • 12

2 Answers2

3

You should distinguish between the "sloppy quorum" and "strict quorum".

As you probably know, a hash function is applied to each key to calculate where that key must be located in the Riak cluster. The entire space of hash values is called a "ring", and is equally divided between vnodes (virtual nodes), which in turn are assigned to physical nodes. The assignment is done in such a way, as to ensure that adjacent vnodes belong to distinct physical nodes for reliability, although it's not always possible. If replication is turned on (i.e. n_val > 1), a key is written not only to its destination vnode, but also to a few nodes that follow the vnode on the ring (different physical nodes in most cases - see above). Now, those are primary nodes for that key. However, in case of a sloppy quorum (for instance, W = 2), if a primary node is not available, replicas of the key will be written to any vnode, potentially even on the same physical node. That's OK, because they will be handed off to the "right" vnodes as soon as the problem is fixed and the primary nodes become available. If you don't want to risk replicas being written to the same physical node even temporary, or want to make sure the client receives the most up-to-date values, you can explicitly require all or at least some writes to be made only to primary vnodes (PW = 2, "P" stands for "primary"). This comes at the expense of high availability, though. The same logic works for reads.

Hope this helps.

I strongly recommend you to read "A Little Riak Book". Also, the online documentation is excellent.

vempo
  • 3,093
  • 1
  • 14
  • 16
1

shouldn't the Quorum number r used in reading operation mean the number of replicas/physical nodes that need to agree before returning data?

Not exactly. The read quorum(r) is the number of vnodes that must provide an acceptable response. When you read with one node down, the rest of the cluster (in this case the remaining node) will start up fallbacks for any missing vnodes as needed.

When your read request with r=2 arrives, since one vnode in the preflist is unavailable, a fallback is started up. Naturally, that fallback is empty when first started, so the read process receives notfound from the fallback and the stored object from the other.

The trick here is the notfound_ok setting in the bucket properties or request options. If left at the default of notfound_ok=true the notfound is considered a valid response, so the operation meets the quorum, the response with data trumps the notfound, and the client gets back an object. This also triggers read repair which will populate the fallback with that object so the next get request will get 2 objects and no notfound responses.

If notfound_ok is false, the first read request will see only 1 valid response and fail, but read repair still happens so the next r=2 request succeeds because the fallback also has the data.

It is a valid tactic to use r=1, notfound_ok=false for reads to get high availability and the fastest possible response while keeping reasonable reassurances that you won't get false notfound responses when a node fails.

Joe
  • 25,000
  • 3
  • 22
  • 44