0

Background

We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.

Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.

Problem

We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).

Question

How to solve the editing problem and relational model issues in our situation?

We came up with couple of solutions:

  1. Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
  2. Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
  3. Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
  4. Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)

We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.

What do you think?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Nenad Bozic
  • 3,724
  • 19
  • 45

1 Answers1

1

If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level. If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me

  • Do you have maybe any opinion on Spark? We do have to add it because of imports of huge historical files from external CSV so we might already have it in project to aid us in JOINS and sorting. – Nenad Bozic Sep 16 '15 at 06:06
  • Sure, if you need join in a batch environment go with spark without thinking. If you need join with spark in a interactive env, something like jdbc connection from a business intellingence tool, you can leverage the thrift server of spark: you can use cache and you will achieve good performance, but operations will suck a bit because thrift server is not production ready in my opinion – axlpado - Agile Lab Sep 16 '15 at 20:12
  • An other viable option is to perform denormalization in spark in a batch way and persist a kind of materialized view in parquet file over CFS. This will allow to perform SQL interaction with god performance without cache everything n memory – axlpado - Agile Lab Sep 16 '15 at 20:14
  • Thanks for explanation. I was thinking of usage in interactive application to sort of add relational stuff to Cassandra cluster on few places where we need it (Administrators want to have all on one view and we have mostly optimized queries and tables for frontend app, so this does not have to be super fast) – Nenad Bozic Sep 17 '15 at 06:33