I am testing CrateDB with a data set of 80 million events sent from a web app, both as a normalized, relational solution, and also as a denormalized, single database solution.
I imported all 80 million denormalized events into a table, and ran the following aggregation query:
select productName, SUM(elapsed)/60 as total_minutes from denormalized
where country_code = 'NL' AND eventType = 'mediaPlay'
group by productName
order by total_minutes desc
limit 1000;
and the query took .009 seconds. Wowza! CrateDB is blazing fast!
Then I imported the sessionwide docs into one table called "sessions", and all the individual event docs in each session into another table called "events", and ran the following query:
select e.productName, SUM(e.elapsed)/60 as total_minutes from sessions s
join events e ON e.sessionGroup = s.sessionGroup
where s.country_code = 'NL' AND e.eventType = 'mediaPlay'
group by e.productName
order by total_minutes desc
limit 1000;
which took 21 seconds.
My question is, is there any way to get faster relational performance, maybe by creating indexes, or changing the query somehow?
Tangential thought: We have been using Elasticsearch for analytics, obviously denormalizing the data, and it's plenty fast, but CrateDB seems to offer everything Elasticsearch does (fast queries on denormalized data, clustering, dynamic schema, full text search), plus the additional advantages of:
- better SQL support
- the option to deploy relational solutions on small data sets (wonderful to standardize on one DB, no context-switching or ramp up for developers who know SQL).