Updating a Cassandra row by multiple processes in different times

Question

Im planning a few ETLs that eventually will "fill" the same row in Cassandra, i.e. if a table is defined as:

CREATE TABLE MyTable (
  key text,
  column1 text,
  column2 text,
  column3 text,
  column4 text, 
  PRIMARY KEY (key)
)

Than few ETLs will fill out the appropriate values in the columns 1-4 in a different times.

How well cassandra handles such operations? Should I read the row first, update in code and then write back or would an UPDATE call will do the trick?

I know that Cassandra is highly optimized for write throughput in that it never modifies data on disk, it only appends to existing files or creates new ones. knowing that, and without diving deeper into the implementation, It worries me that if an ETL will write column4 and 20 minutes later a different ETL will write column2, I will lose a lot of performance comparing to waiting for all the ETLs to finish and than save all the data in a bulk (which is not an easy implementation by itself).

Ideas?

score 1 · Accepted Answer · answered Aug 20 '15 at 13:21

1

All inserts/updates in Cassandra are Upserts, and Cassandra uses last-write-wins for conflict resolution. If your ETLs are updating different columns, there would be no issue. If they update the same column, the last update for a column will win. If this is an issue, you can add a timestamp column as a clustering key (allowing multiple values of the data), and during read, read the latest one. You could also add a TTL so older irrelevant values get cleared out.

If certain columns are updated, and others aren't, you'll effectively get null for those columns when querying.

I couldn't really understand your last paragraph. Could you please explain your concern?

answered Aug 20 '15 at 13:21

ashic

6,367
5
33
54

Imagine I ETL1 updated column1. Ten min later ETL2 updated column2, and so on. If we assume that each partition is a file on disk, and that the first update does not reserve space for the rest of the columns (because it only concern with column1 and the columns are text, so the next update sizes are unknown), than later updates might not be able to be inplace and the file might need to be copied to a different location. – idoda Aug 20 '15 at 13:51
Cassandra will take care of that. Writes go to commit log on disk and in memory first - SSTable storage is handled later. As such, your use case won't run into issues. It's a fully supported scenario. – ashic Aug 21 '15 at 04:24
Thanks alot. Can you link me to a documentation about that or similar scenario? – idoda Aug 23 '15 at 15:30

Updating a Cassandra row by multiple processes in different times

1 Answers1