0

CQL Execution [returns instantly, assuming uses clustering key index]:

cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';

 count
-------
  5447

Presto Execution [takes around 8secs]:

presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
  c   
------
 5447 
(1 row)

Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]

Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?

Why presto is not able to use the clustering key optimization?

I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.

CF Reference:

CREATE TABLE events (
  month text,
  day timestamp,
  test_data text,
  some_random_column text,
  event_time timestamp,
  PRIMARY KEY (month, day, event_time)
)  WITH comment='Test Data'
AND read_repair_chance = 1.0;

Added event_timestamp too as a constraint in response to Dain's answer

presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
 _col0 
-------
     1 
(1 row)

Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
Tamil
  • 5,260
  • 9
  • 40
  • 61

1 Answers1

1

The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.

The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.

I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Dain Sundstrom
  • 2,699
  • 15
  • 14
  • I tried event_time as we as a constraint in my select query, but it didn't seem to have helped – Tamil Mar 01 '15 at 07:09
  • https://github.com/facebook/presto/issues/2341 Should I attribute this issue for the behavior? – Tamil Mar 01 '15 at 07:25
  • https://github.com/facebook/presto/pull/1520 Seems to be a resolution for the same issue – Tamil Mar 01 '15 at 07:45