0

Overview

I would like to determine the correct schema in cassandra for financial tick data.

Data and schema

I have the following sample data in csv:

SYMBOL,DATE,TIME,PRICE,SIZE
A,2011-01-03,9:28:00,41.46,200
A,2011-01-03,9:28:00,41.46,100
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.55,100
A,2011-01-03,9:30:19,41.55,200
A,2011-01-03,9:30:23,41.5169,100
A,2011-01-03,9:30:29,41.44,66534
A,2011-01-03,9:30:29,41.45,225
A,2011-01-03,9:30:30,41.44,100
A,2011-01-03,9:30:30,41.43,100
A,2011-01-03,9:30:30,41.49,100
A,2011-01-03,9:30:30,41.45,200

and I store into the following table:

CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
PRIMARY KEY ((symbol,date),time)
);

This a slice of a SELECT of the table:

 symbol | date       | time               | price   | size
--------+------------+--------------------+---------+-------
      A | 2011-01-03 | 09:28:00.000000000 |   41.46 |   100
      A | 2011-01-03 | 09:30:00.000000000 |   41.56 |  1300
      A | 2011-01-03 | 09:30:19.000000000 |   41.55 |   200
      A | 2011-01-03 | 09:30:23.000000000 | 41.5169 |   100
      A | 2011-01-03 | 09:30:29.000000000 |   41.45 | 66534

Use case

The data will be written to Cassandra once, and mostly read with conditions on date and symbol, e.g. a set of symbols for a given time-period.

Questions

  • The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my granularity is limited to seconds. Hence, the COPY FROM e.g. drops the second row of the csv during the import due to the repetition in the key. How can I preserve the record?

  • Assuming the PRIMARY KEY is unique, how can I avoid storing repeated values of SYMBOL and DATE? Or is partitioning taking care of that under the hood?

  • I was thinking to use the following schema:

    CREATE TABLE tickdata (
    symbol text,
    date date,
    time blob,
    price blob,
    size blob,
    PRIMARY KEY ((symbol,date))
    );
    

    to store raw data. Is this the correct way to address the points above?

  • The data is NOT ordered according to the definition of the PRIMARY KEY when I SELECT it. Is that related to the non-uniqueness problem entioned above?

  • Should I stick with my binary file-store which keeps a map of symbols and dates and loads the relevant files on request? This avoids repeating symbol and date for each row and is indifferent to limited granularity (repetition) of the timestamp.

Oleg
  • 10,406
  • 3
  • 29
  • 57

1 Answers1

3

The tuple (symbol,date,time) are not a proper PRIMARY KEY, since my granularity is limited to seconds. Hence, the COPY FROM e.g. drops the second row of the csv during the import due to the repetition in the key. How can I preserve the record?

The primary key in your first table definition is ((symbol,date),time) NOT (symbol,date,time). Both are different in cassandra.

((symbol,date),time) => will store all records for same symbol (A) and date in one node. For same symbol(A) but other date might go on other node. Row Key will be symbol+date

Physical Data layout (example)

|A_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|A_2011-01-04||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-03||time1.price & time1.value||time2.price & time2.value|
|B_2011-01-04||time1.price & time1.value||time2.price & time2.value|

(symbol,date,time) => All records for same symbol will reside on one node. This might result in wide rows. Row key will be symbol.

Physical Data layout (example)

|A||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|
|B||date1.time1.price & date1.time1.value||date1.time2.price & date1.time2.value||date2.time1.price & date2.time1.value||date2.time2.price & date2.time2.value|

To Avoid dropping of records you can add one more column like uuid or timeuuid

CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
);

Assuming the PRIMARY KEY is unique, how can I avoid storing repeated values of SYMBOL and DATE? Or is partitioning taking care of that under the hood?

Based on physical storage structure explained above this issue is already taken care of.

The alternate schema you are talking about will have only 1 record for one symbol and a date. You will have to handle the blob at application side... which i think might be overhead.

The data is NOT ordered according to the definition of the PRIMARY KEY when I SELECT it. Is that related to the non-uniqueness problem entioned above?

By default data is ordered by clustering key in ascending order (in your case time). Though you can change order by changing CLUSTERING ORDER BY property of table to descending.

Example:

CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
id timeuuid
PRIMARY KEY ((symbol,date),time,id)
) WITH CLUSTERING ORDER BY(time desc,id desc); 

Should I stick with my binary file-store which keeps a map of symbols and dates and loads the relevant files on request? This avoids repeating symbol and date for each row and is indifferent to limited granularity (repetition) of the timestamp.

you can decide this on your own :)

undefined_variable
  • 6,180
  • 2
  • 22
  • 37
  • Great answer, thanks! I tried to google more on the data layout but is there any resource on the matter that you consider a must read? – Oleg May 09 '17 at 18:54
  • To answer my question, a very useful read is: [Understanding How CQL3 Maps to Cassandra’s Internal Data Structure](http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/) – Oleg May 11 '17 at 12:21
  • This answer is relevant material as well: http://stackoverflow.com/a/27277449/2180721 – Oleg May 18 '17 at 12:58