Overview
I would like to determine the correct schema in cassandra for financial tick data.
Data and schema
I have the following sample data in csv:
SYMBOL,DATE,TIME,PRICE,SIZE
A,2011-01-03,9:28:00,41.46,200
A,2011-01-03,9:28:00,41.46,100
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.56,1300
A,2011-01-03,9:30:00,41.55,100
A,2011-01-03,9:30:19,41.55,200
A,2011-01-03,9:30:23,41.5169,100
A,2011-01-03,9:30:29,41.44,66534
A,2011-01-03,9:30:29,41.45,225
A,2011-01-03,9:30:30,41.44,100
A,2011-01-03,9:30:30,41.43,100
A,2011-01-03,9:30:30,41.49,100
A,2011-01-03,9:30:30,41.45,200
and I store into the following table:
CREATE TABLE tickdata (
symbol text,
date date,
time time,
price float,
size int,
PRIMARY KEY ((symbol,date),time)
);
This a slice of a SELECT
of the table:
symbol | date | time | price | size
--------+------------+--------------------+---------+-------
A | 2011-01-03 | 09:28:00.000000000 | 41.46 | 100
A | 2011-01-03 | 09:30:00.000000000 | 41.56 | 1300
A | 2011-01-03 | 09:30:19.000000000 | 41.55 | 200
A | 2011-01-03 | 09:30:23.000000000 | 41.5169 | 100
A | 2011-01-03 | 09:30:29.000000000 | 41.45 | 66534
Use case
The data will be written to Cassandra once, and mostly read with conditions on date
and symbol
, e.g. a set of symbols for a given time-period.
Questions
The tuple
(symbol,date,time)
are not a properPRIMARY KEY
, since my granularity is limited to seconds. Hence, theCOPY FROM
e.g. drops the second row of the csv during the import due to the repetition in the key. How can I preserve the record?Assuming the
PRIMARY KEY
is unique, how can I avoid storing repeated values ofSYMBOL
andDATE
? Or is partitioning taking care of that under the hood?I was thinking to use the following schema:
CREATE TABLE tickdata ( symbol text, date date, time blob, price blob, size blob, PRIMARY KEY ((symbol,date)) );
to store raw data. Is this the correct way to address the points above?
The data is NOT ordered according to the definition of the
PRIMARY KEY
when ISELECT
it. Is that related to the non-uniqueness problem entioned above?Should I stick with my binary file-store which keeps a map of symbols and dates and loads the relevant files on request? This avoids repeating symbol and date for each row and is indifferent to limited granularity (repetition) of the timestamp.