2

This is the content of my RDD which I am saving to Cassandra table. But looks like the 2nd row is written first and then the first row overwrites it. So I end up with bad output.

(494bce4f393b474980290b8d1b6ebef9, 2017-02-01, PT0H9M30S, WEDNESDAY) (494bce4f393b474980290b8d1b6ebef9, 2017-02-01, PT0H10M0S, WEDNESDAY)

Is there a way to force the order of the rows written to Cassandra. Please help. Thanks

shylas
  • 99
  • 4
  • 13

3 Answers3

4

Is there an order to SaveToCassandra?

Within a single task execution is deterministic but that may not be the ordering you are expecting. There are two things to think about here.

  1. RDDs are made of Spark Partitions and the order of execution for these partitions is dependent on system conditions. Having different numbers of cores, heterogeneous machines or executor failures could all change execution order. Two Spark Partitions with data for the same Cassandra Partition could be executed in any order based on the system.
  2. For each Spark partition, records are batched in the same order as they are received but this does not necessarily mean that they will be sent to Cassandra in the same order. There are settings in the connector that determine when a Batch is sent and it is conceivable that batch containing later data will be executed before a batch with earlier data. This means while the order in which the batches are sent is deterministic but not necessarily in the same order as the previous iterator.

Does this matter for your application?

Probably not. This should only really matter if your data is really spread out in the RDD. If entries for a particular Cassandra Partition are spread amongst multiple Spark Partitions then the order of Spark Execution could mess up your upsert. Consider

Spark Partition 1 has Record A
Spark Partition 2 has Record B

Both Spark Partitions have work start simultaneously, but Record B is
reached before Record A.

But I think this is unlikely the issue.

The issue you are running into is most likely the common: the order of statements in my batch is not respected. The core of this issue is that all statements within a Cassandra batch are executed "simultaneously." This means that if there are conflicts for any Primary Key there needs to be a conflict resolution. In these cases Cassandra chooses the greater cell value for all conflicts. Since the connector is automatically batching together writes to the same partition key, you can end up with conflicts.

You can see this in your example, the larger value (PT0H9M30S) is kept and the smaller(PT0H10M0S) is discarded. The problem isn't really the order, but the fact that the batching is occurring.

How can I do upserts based on time then?

Very carefully. There are a few approaches I would consider taking.

The best option would be to not do upserts based on time. If you have multiple entries for a PRIMARY_KEY but only want the last one, do the reduction in Spark prior to hitting Cassandra. Removing your unwanted entries before you try to write will save time and load on your Cassandra cluster. Otherwise you are using Cassandra as a rather expensive de-duping machine.

A much worse option would be to just disable the batching in the Spark Cassandra Connector. This will hurt performance but will fix the issue if you only care about the order within Spark Partitions. This will still cause conflicts if you have multiple Spark Partitions because you cannot control their order of execution.

The Moral of this Story

State is bad. Order is bad. Design your system to be idempotent if at all possible. If there are multiple records and you know which ones matter, remove the ones that don't before you get to a distributed LWW system.

Community
  • 1
  • 1
RussS
  • 16,476
  • 1
  • 34
  • 62
3

This all depends on the definition of the table that you make. Ordering in partition key (first part of the primary key) is not guaranteed.

The rest of the primary key is used to sort the keys within the partition. This is where your problem comes from. You have to define clustering columns.

It is described here: https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html

The ordering of the inserts still matter but only in the sense if there are two equal information, the last one wins. Which is not the case here I think.

Also you might consider putting information that you have in the "PT0H9M30S" under a clustering column so that you keep your data and don't overwrite it.

Marko Švaljek
  • 2,071
  • 1
  • 14
  • 26
  • I do want the upsert, The first column is the user and the third is the period. and my key is (userid, date) For a given user and date combination I want to see only 1 row. So I cannot add the period to key. But my problem is that PT0H10M0S is overwritten by PT0H9M30S, eventhough the order of the rows in the RDD is PT0H9M30S and then PT0H10M0S. Appreciate your input. Thanks – shylas Feb 03 '17 at 17:59
1

Cassandra is Time series database. You should design your table such that no over write occurs. Or If you want to write the earliest/latest time stamp then you should reduce your RDD using transformation like reduceByKey to retain only the earliest/latest timestamp information for a particular key.

Knight71
  • 2,927
  • 5
  • 37
  • 63