0

When executing a LWT such as INSERT .... IF NOT EXISTS, the Java driver may throw a CASWriteUnknownException. It's unclear (a) what phase failed (b) if the LWT transaction has completely failed or was partially successful and caused a mutation and (b) what the appropriate behavior of the client should be to handle this exception.

Cassanda's Paxos has four phases:

  • Prepare/Promise
  • Read/Results
  • Propose/Accept
  • Commit/Acknowledge

Which are implemented as:

Phase Implemention Method Failure Exceptions
Prepare/Promise StorageProxy.beginAndRepairPaxos() WriteTimeoutException, WriteFailureException
Read/Results ? ?
Propose/Accept StorageProxy.proposePaxos() WriteTimeoutException, CasWriteUnknownResultException
Commit/Acknowledge StorageProxy.commitPaxos() WriteTimeoutException

The Java Driver 4.x documentation is silent on the CASWriteUnknownException, but the driver code has comment on the exception class which is terse and self-referential using the same words:

/**
 * The result of a CAS operation is in an unknown state.
 *
 *...
 */
public class CASWriteUnknownException...

The Native Protocol v5 Spec has a bit more information but is also not very clear on the state of the transaction, and raises other questions about what may or may not be completed:

0x1700 CAS_WRITE_UNKNOWN: An exception occured due to contended Compare And Set write/update. The CAS operation was only partially completed and the operation may or may not get completed by the contending CAS write or SERIAL/LOCAL_SERIAL read.

Finally, looking at the Apache Cassandra server source code, the exception appears to only be thrown during the 3rd propose phase of Paxos:

    /**
     * Propose the {@param proposal} accoding to the {@param replicaPlan}.
     * When {@param backoffIfPartial} is true, the proposer backs off when seeing the proposal being accepted by some but not a quorum.
     * The result of the cooresponding CAS in uncertain as the accepted proposal may or may not be spread to other nodes in later rounds.
     */

    private static boolean proposePaxos(Commit proposal, ReplicaPlan.ForPaxosWrite replicaPlan, boolean backoffIfPartial, long queryStartNanoTime)
    throws WriteTimeoutException, CasWriteUnknownResultException

If that is the case -- that it occurred before the commit phase, is there any need to worry that partial commits happened? WriteTimeout is the only other exception thrown by propose.

Brad Schoening
  • 1,281
  • 6
  • 22

1 Answers1

0

I agree with your analysis. Since the exception is thrown during the propose phase, I don't see a risk of commits being applied since it doesn't get to the commit phase.

It looks to me that for the propose phase to fail with CasWriteUnknownResultException, the replicas would have to be overloaded or experienced some server failure so they couldn't accept the proposal in time.

Without the acceptance from a quorum of replicas then there shouldn't be a commit. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Thanks, Eric. It looks like https://issues.apache.org/jira/browse/CASSANDRA-15350 introduced the CasWriteUnknownException in 4.0. There is a lot to unpack in the Jira text and dialog, but it states: "One of this conditions it manifests is when there’s at least one acceptor that has accepted the value, which means that this value may still get accepted during the later round, despite the proposer failure. " – Brad Schoening Aug 22 '23 at 17:08