We have data model of article with lot of properties. Here is our table model:
CREATE TABLE articles (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, gtin)
) WITH COMMENT='Articles';
Where gtin uniquely identifies article and we save all articles of organization in one row. We have constraint to update each article only if something has changed. This is important since if article is changed, we update last_updated
field and external devices know which articles to synchronizes since they have information when they synchronized last time.
We added one more table for that:
CREATE TABLE articles_by_last_updated (
organization_id bigint,
gtin text,
barcodes text,
code text,
brand text,
season text,
name text,
option text,
style text,
color text,
sizes text,
supplier text,
category text,
prices text,
last_updated timeuuid,
content_hash uuid,
markdown boolean,
PRIMARY KEY (organization_id, last_updated)
) WITH CLUSTERING ORDER BY (last_updated ASC) AND COMMENT='Articles by last updated field';
So we can easily return all articles updated after certain point in time. This table must be cleared from duplicates per gtin since we import articles each day and sync is done from mobile devices so we want to keep dataset small (in theory we could save everything in that table, and overwrite with latest info but that created large datasets between syncs so we started deleting from that table, and to delete we needed to know last_updated
from first table)
Problems we are facing right now are:
- In order to check if article fields are updated we need to do read before write (we partially solved that with
content_hash
field which is hash over all fields so we read and compare hash of incoming article with value in DB) - We are deleting and inserting in second table since we need unique
gtins
there (need only latest change to send to devices, not duplicate articles) which produces awful lot of tombstones - We have feature to add to search by many different combinations of fields
Questions:
- Is cassandra good choice for this kind of data or we should move it to some other storage (or even have
elasticsearch
andcassandra
in combination where we can index changes after time and cassandra can hold only master data pergtin
)? - Can data be modeled better for our use case to avoid read before write or deletes in second table?
Update
Just to clarify use case: other devices are syncing with pagination (sending last_sync_date
and skip
and count
) so we need table with all article information, sorted by last_updated
without duplicates and searchable by last_updated