0

I am testing CrateDB with a data set of 80 million events sent from a web app, both as a normalized, relational solution, and also as a denormalized, single database solution.

I imported all 80 million denormalized events into a table, and ran the following aggregation query:

select productName, SUM(elapsed)/60 as total_minutes from denormalized
where country_code = 'NL' AND eventType = 'mediaPlay' 
group by productName
order by total_minutes desc
limit 1000;

and the query took .009 seconds. Wowza! CrateDB is blazing fast!

Then I imported the sessionwide docs into one table called "sessions", and all the individual event docs in each session into another table called "events", and ran the following query:

select e.productName, SUM(e.elapsed)/60 as total_minutes from sessions s
join events e ON e.sessionGroup = s.sessionGroup
where s.country_code = 'NL' AND e.eventType = 'mediaPlay' 
group by e.productName
order by total_minutes desc
limit 1000;

which took 21 seconds.

My question is, is there any way to get faster relational performance, maybe by creating indexes, or changing the query somehow?

Tangential thought: We have been using Elasticsearch for analytics, obviously denormalizing the data, and it's plenty fast, but CrateDB seems to offer everything Elasticsearch does (fast queries on denormalized data, clustering, dynamic schema, full text search), plus the additional advantages of:

  • better SQL support
  • the option to deploy relational solutions on small data sets (wonderful to standardize on one DB, no context-switching or ramp up for developers who know SQL).
James Jensen
  • 255
  • 1
  • 2
  • 11

1 Answers1

0

What CrateDB version are you using? If it is < 3.0, than upgrading will probably boost the join query a lot, see https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three/.

Sebastian Utz
  • 719
  • 3
  • 9
  • It's version 3.0.5. – James Jensen Sep 04 '18 at 05:32
  • 1
    The join query will hit the same indices as your normal query (CrateDB is creating indices by default for each field) so there is nothing you could improve. As joins and aggregations/grouping can use all nodes in a cluster to join/aggregate/group in parallel, scaling out can be an option to improve the query speed. – Sebastian Utz Sep 12 '18 at 07:29
  • Thanks! Good info. – James Jensen Sep 20 '18 at 20:17