2

We have data model of article with lot of properties. Here is our table model:

CREATE TABLE articles (
    organization_id bigint,
    gtin text,
    barcodes text,
    code text,
    brand text,
    season text,
    name text,
    option text,
    style text,
    color text,
    sizes text,
    supplier text,
    category text,
    prices text,
    last_updated timeuuid,
    content_hash uuid,
    markdown boolean,
    PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';

Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated field and external devices know which articles to synchronizes since they have information when they synchronized last time.

We added one more table for that:

CREATE TABLE articles_by_last_updated (
    organization_id bigint,
    gtin text,
    barcodes text,
    code text,
    brand text,
    season text,
    name text,
    option text,
    style text,
    color text,
    sizes text,
    supplier text,
    category text,
    prices text,
    last_updated timeuuid,
    content_hash uuid,
    markdown boolean,
    PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';

So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated from first table)

Problems we are facing right now are:

  1. In order to check if article fields are updated we need to do read before write (we partially solved that with content_hash field which is hash over all fields so we read and compare hash of incoming article with value in DB)
  2. We are deleting and inserting in second table since we need unique gtins there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones
  3. We have feature to add to search by many different combinations of fields

Questions:

  1. Is cassandra good choice for this kind of data or we should move it to some other storage (or even have elasticsearch and cassandra in combination where we can index changes after time and cassandra can hold only master data per gtin)?
  2. Can data be modeled better for our use case to avoid read before write or deletes in second table?

Update Just to clarify use case: other devices are syncing with pagination (sending last_sync_date and skip and count) so we need table with all article information, sorted by last_updated without duplicates and searchable by last_updated

Community
  • 1
  • 1
Nenad Bozic
  • 3,724
  • 19
  • 45

1 Answers1

4

If you are using Cassandra 2.1.1 and later, then you can use the "not equal" comparison in the IF part of the UPDATE statement (see CASSANDRA-6839 JIRA issue) to make sure you update data only if anything has changed. Your statement would look something like this:

UPDATE articles 
SET 
  barcodes = <barcodes>, 
  ... = <...>, 
  last_updated = <last_updated> 
WHERE 
  organization_id = <organization_id> 
  AND gtin = <gtin> 
IF content_hash != <content_hash>;

For your second table, you don't need to duplicate entire data from the first table as you can do the following:

create your table like this:

CREATE TABLE articles_by_last_updated (
    organization_id bigint,
    last_updated timeuuid,
    gtin text,
    PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';

Once you've updated the first table, you can read the last_updated value for that gtin again and if it is equal or greater than the last_updated value you passed in, then you know that the update was successful (by your or another process), so you can now go ahead and insert that retrieved last_updated value into the second table. You don't need to delete the records for this update. I assume you can create distinct updated gtin list on the application side, if you do polling (using a range query) on a regular basis, which I assume pulls a reasonable amount of data. You can TTL these new records after a few poll cycles to remove a necessity to do manual deletes for example. Then, after you found the gtins affected, then you do a second query where you pull all of the data from the first table. You can then run a second sanity check on the updated dates to avoid sending anything that is supposed to be sent on the next update (if it is necessary of course).

HTH.

Roman Tumaykin
  • 1,921
  • 11
  • 11
  • Thanks @Roman, first part is helpful and I tried that but prior to ticket change you posted. As for second part, other devices are syncing with pagination (sending `last_sync_date` and `skip` and `count`) so I need table with all article information, sorted by `last_updated` without duplicates and searchable by `last_updated` – Nenad Bozic Mar 19 '15 at 07:42
  • Then maybe it is even more important to preserve consistency during the pagination. Imagine if you fetched 100 rows, then before you fetch next 100 you delete 10 from the previously retrieved results because the articles have been updated. Then your next fetch will pull what would have been previously rows 111 through 210, essentially not sending updates for the rows 101-110. – Roman Tumaykin Mar 20 '15 at 00:02