0

we're having some weird things happening with a cleanup cronjob and riak:

the objects we store (postboxes) have a 2i for modification date (which is a unix timestamp). there's a cronjob running freqently deleting all postboxes that have not been modified within 180 days. however we've found evidence that postboxes that some (very little) postboxes that were modified in the last three days were deleted by this cronjob. After reviewing and debugging several times over every line of code, I am confident, that this is not a problem of the cronjob.

  • I also traced back all delete calls to that bucket - and no one else is deleting objects there.
  • Of course I also checked with Riak to read the postboxes with r=ALL: they're definitely gone. (and they are stored with w=QUORUM)
  • I also checked the logs: updating the post boxes did succeed (there were no errors reported back from the write operations)

This leaves me with two possible causes for this:

  • riak loses data (which I am not willing to believe that easily)
  • the secondary indexes are corrupt and queries to them return wrong keys

So my questions are:

  • Can 2is actually break?
  • Is it possible to verify that?
  • Am I missing something completely different?

Cheers, Matthias

Matthias
  • 2,622
  • 1
  • 18
  • 29

1 Answers1

1

Secondary index queries in Riak are coverage queries, which means that they will only use one of the stored replicas, and not perform a quorum read.

As you are writing with w=QUORUM, it is possible that one (or more) of the replicas may not get updated if you have n_val set to 3 or higher while the operation still is deemed successful. If this is the one selected for the coverage query, you could end up deleting based on the old value. In order to avoid this, you will need to perform updates with w=ALL.

Christian Dahlqvist
  • 1,665
  • 12
  • 9
  • Hi Christian, in our case, however this would mean that an object was not transferred to all nodes within up to 181 days - and that timeframe does sound like it should be enough to be rather sure that everything has been replicated (esp. with AAE on and we're writing with QUORUM)? am I making incorrect assumptions here? – Matthias Jul 23 '14 at 11:25
  • The update that was performed within the last 3 days that contained an updated timestamp may not have been successfully updated to all partitions that hold a copy of the original value. This could cause the old timestamp to be found, leading to the object being deleted. – Christian Dahlqvist Jul 23 '14 at 12:05
  • Another thing that could cause the issue you are seeing is a bug in AAE that was fixed in the Riak 1.4.8 release. This bug made AAE stop detecting object modifications, causing it to not repair this )https://github.com/basho/riak/blob/riak-1.4.8/RELEASE-NOTES.md). I am not sure exactly when this bug was introduced, but believe it was somewhere around 1.4.3 or so. – Christian Dahlqvist Jul 24 '14 at 07:51