1

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.

Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.

commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.

I need suggestion on design of these components. Especially the below:

commerceOrderRecorderService

  1. What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill

  2. What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?

batchCommerceOrderProcessor

  1. How should the rows be retrieved for processing?
  2. How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
  3. How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.

Appreciate design inputs, code samples and pointers to libraries. Thanks.

Santanu Dey
  • 2,900
  • 3
  • 24
  • 38
  • Given how easy it is to install a database server and given how specific a type of database server cassandra is, I think your motivation for choosing cassandra ("that's our db") is wrong. – flup Jan 20 '14 at 11:26
  • @flup, It is like a design constraint if you like, based on the legacy. Feel free to throw more light on what do think would make sense. I was really hoping for inputs within the given constraints. – Santanu Dey Jan 20 '14 at 13:11
  • What I mean to say is, don't choose a nosql database just cause you already have one in place, but for instance because you need the scalability. Question about the orders: can you give a more functional description of what the system needs to do? What does a sample order look like? Where does it go? The way I read you now, each order gets shredded into order lines that get distributed across different systems. Is this correct? If so, what happens next and does the system have any other responsibilities, such as combining the status of the distributed lines back into an order status? – flup Jan 20 '14 at 21:06
  • Orders will flow to an Order Management System, get persisted in an RDBMs, etc. But what happens downstream to the orders is not relevant to this problem really. As far as the commerceOrderRecorderService is concerned it just reliably persists the orders, checks for duplicate order ID before persisting. It does not even care about the composition of the order items, value etc. It is completely agnostic to the payload. – Santanu Dey Jan 23 '14 at 14:38

2 Answers2

2

Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:

  1. Cassandra to store the orders, analytics and what have you.
  2. Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
  3. Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:

    a) Spout process that dequeues next order from the queue and pass it to the second process
    b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
    c) Third Bolt process that pushes the order to other downstream systems.

Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.

Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.

OlegM
  • 106
  • 1
  • 5
1

This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.

https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example

Hope it helps. BTW it uses the latest cassandra 2.0.5.