8

I just read the DataStax post "Basic Rules of Cassandra Data Modeling" and, to sum up, we should modeling our database schema by our queries and not by our relations/objects. So, many tables can have the same duplicated data, for example users_by_email and users_by_username which both have the same data.

How can I handle the object update ?
For example the user edit his email, do I UPDATE both tables manually or only INSERT the object with all columns and don't care about previous data (which are still in my database, but with a wrong column value => email).

In case of UPDATE, how can I handle data synchronization ?
Currently, I'm doing it manually but is there a tool to help me ? Because, possibly, I can have 5 or 6 tables with different partition/clustering keys.
I heard that Hadoop can do it, or Apache Spark.

Community
  • 1
  • 1
Divi
  • 800
  • 1
  • 7
  • 16

3 Answers3

4

To ensure data consistency across your many tables containing the same data, but laid out differently, it's recommended that you use a LOGGED BATCH in CQL to do the update. This way your CQL statements (updating data) in your BATCH are ACID, and you don't have to worry about some failing and retrying.

Using the linked article's schema it would look like:

BEGIN BATCH
  INSERT INTO users_by_email (email, username, age) VALUES ('fromanator@email.com', 'fromanator', 24);
  INSERT INTO users_by_username (email, username, age) VALUES ('fromanator@email.com', 'fromanator', 24);
APPLY BATCH;

This whole statement is Atomic, if one insert fails, they all fail and no change was made.

fromanator
  • 854
  • 4
  • 9
  • 1
    but how do you maintain updates? Like for table T1 the PK is id and for table T2 PK is new_id, lets say at a given time I only have id and not new_id how do I update T2 ? Should I get full row from T1 read new_id and use that to update t2? coz this sounds very heavy – im_bhatman Sep 19 '19 at 22:01
3

In Cassadnra, given an existing record, an update or insert using the same primary key will result in the old record marked for deletion (with a tombstone) and the new record becomes "live". There're few subtleties in the difference between Insert and Update, like counters and null values, but those are probably not relevant for the question.

Up to Cassandra 3.0, the responsibility of maintaining several views of the same data in sync is in hands of the client application. And yes, it means to insert/update the new data in all the different tables that require it.

Cassandra 3.0 introduced "Materialized Views", which lets you maintain a "master" table of the data and several views on it, all managed by Cassandra. It requires careful data modelling so that the primary key of the 'master' table contains the required entities to create the different views and related queries needed.

One additional note: If you find that your data is highly relational and requires several/many views to make it query-able, maybe Cassandra is not a good fit for the problem space and probably you should consider a RDBMS instead.

To extend on the example provided, probably user information is something we would like to keep in a relational DB, while high volume actions of those users could be registered in Cassandra. (purchases, clicks, heart rate samples, ...)

maasg
  • 37,100
  • 11
  • 88
  • 115
  • Having two (or more?!) systems to manage your data, like Cassandra and an RDBMS, makes things way more complicated. Plus one or the other could become a bottleneck in your current install. – Alexis Wilke Mar 28 '16 at 21:44
  • Except when Inserting or Updating data with an explicit null value you will not create a tombstone. Instead Cassandra will write the update in a new SSTable, and the "live" data is created during the read path by finding the most recently updated column value across multiple SSTables. Compaction will eventually run to ensure that Cassandra doesn't have to read too many SSTables to get the latest data. – fromanator Mar 28 '16 at 23:50
  • @AlexisWilke There're always pro's and con's. If the piece of highly relational data is relatively small, one could afford the extra effort in modelling within the CQL boundaries. In other cases, an organization may already have an RDBMS in place, and instead of trying to migrate everything to Cassandra, one could just complement the architecture with the scalability properties of cassandra for the data that actually needs it. I've seen several well formulated Relational-+-NoSQL architectures. – maasg Mar 31 '16 at 13:26
1

What I've done in my system is have a unique identifier for each user.

I use one table of email / identifier (and some other data). When a user logs in or uses the system, I use his email to find the identifier, then everything else uses that identifier.

The user can now change his email address, the identifier stays the same so all the other tables do not require an UPDATE for such a change.

In regard to the old email address, I have not done it all yet, but I plan to have the current email reference the old one (a "link", if you'd like) and after a certain amount of time, maybe 12 months, the old email will get deleted. For those 12 months, the account is blocked (no one else can reuse that account.) This is a good idea for various security reasons.

P.S. For a unique identifier, people use different solutions, such as Zookeeper, I personally liked to use Cassandra with the Lamport's bakery algorithm.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
  • For your unique identifier why not use the [timeuuid](https://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html) that Cassandra already provides you with? – fromanator Mar 28 '16 at 23:59
  • First of all, I started with 0.8 and thrift. So such features were not all there. Second, for a user, where you want to have a URI such as example.com/user/123, using a UUID is **really** ugly. Lastly, although that you could argue it probably won't matter much, my identifiers can be 32 bit integers, a UUID is 128 bits. That's more data to move around. – Alexis Wilke Mar 29 '16 at 00:56