1

I have to compare two tables in Cassandra to get the differences. Here is the requirement. We have to perform inventory count where we’ll enter/scan each and every items in stock and after finish we’ll compare all with the master inventory table to get the variance. I created a temp table in Cassandra where I’ll insert record against each scan.

**TempInventory**

userId
storeId
skuId
PK(storeId, skuId)

I have master table with other details –

**Inventory**

storeId
skuId
skuDesc
..
..
PK(storeId)

Once scan completed then on submit I have to compare tempInventory with Inventory table to get the differences. So what is the best way of doing this in Cassandra as we cannot use joins –

  1. Get everything in Java class in collection of objects and then compare (Use Java 8 features for better performance) [in this case Inventory table size may be more than 3000. So will this be fine to get everything in JVM)
  2. Use spark SQL with Cassandra which allow to use Joins (Spark is new for me so does not have better idea. Some links of examples would be helpful)
  3. Is there any other utility available (e.g. from Apache)
  4. I am using Gemfire also. But I think we can not create region in gemfire with composite key. Please correct me.

Please suggest what approach is most suitable.

Saurabh
  • 2,384
  • 6
  • 37
  • 52

1 Answers1

0

Correct, Cassandra does not offer any built in mechanism to compare to tables, you need to do it yourself.

A first suggestion would be to use the same primary key in both tables. Do you need to add skuId in the PK of your temp table? This would make fetching the data to compare dificult.

I would say the answer depends on the amount of data you need to process. If you have a large amount (hundreds of GBs, or more), it would be worth to use Spark or Storm to do the stream processing. If you don't have that much, you can use a simple Java program. It might take a while to complete, but you would not have to put in place Spark or Storm.

Gabe Thorns
  • 1,426
  • 16
  • 20
  • Thanks. Multiple users can scan and they can scan same sku. So we dont need to have duplicate rows for same sku. This is why sku is part of PK. – Saurabh Aug 22 '17 at 21:21