Problems with distinct?

Question

At my Crate.io Database I have a table with currently 50 mio rowsand a size of 16GB. If I try to get the amount of entries per day with the following statement all works fine (except the performance, but this should not be the problem at the moment):

SELECT
date_format('%Y-%m-%d', date_trunc('day', "days")) AS "Day",
count(*) AS "Count"
FROM "doc"."mytable" 
WHERE
    date_format('%Y-%m-%d', date_trunc('day', "days")) BETWEEN date_format('%Y-%m-%d', date_trunc('day', current_timestamp + -2592000000))
    AND date_format('%Y-%m-%d', date_trunc('day', current_timestamp + -86400000)) 
GROUP BY date_format('%Y-%m-%d', date_trunc('day', "days")) 
ORDER BY date_format('%Y-%m-%d', date_trunc('day', "days")) ASC limit 100;

But if I try to make a distinct at another column like this:

SELECT
date_format('%Y-%m-%d', date_trunc('day', "days")) AS "Day",
count(DISTINCT customerid) AS "Count"
FROM "doc"."mytable" 
WHERE
    date_format('%Y-%m-%d', date_trunc('day', "days")) BETWEEN date_format('%Y-%m-%d', date_trunc('day', current_timestamp + -2592000000))
    AND date_format('%Y-%m-%d', date_trunc('day', current_timestamp + -86400000)) 
GROUP BY date_format('%Y-%m-%d', date_trunc('day', "days")) 
ORDER BY date_format('%Y-%m-%d', date_trunc('day', "days")) ASC limit 100;

the statement will fail with

SQLActionException[CircuitBreakingException: [query] Data too large, data for [collect: 0] would be larger than limit of [1267571097/1.1gb]]

Has anybody an idea why a COUNT(DISTINCT col) has a problem with too much data, but a COUNT(*) not? And how can I solve this issue?

score 0 · Answer 1 · answered May 11 '17 at 06:31

0

A count is a very lightweight operation because just one numeric (long) is needed which will be incremented for each row. Whereas for a distinct count all values must be saved in memory to be able to decide if a value already exist (no increment) or is a new one (increment counter).

To get around the CircuitBreakingException (which btw. saves you from a stucked node, otherwise an OutOfMemory would be thrown and your node would be unusable), increase the HEAP for the crate process. How to set the HEAP size varies depending on the used distribution, normally a CRATE_HEAP_SIZE environment variable can be used.

Increasing the HEAP could also give you better performance for your 1st group by query. A good rule is to give crate 50% of the available memory, so the other 50% can be used by the OS file system cache (which crate also benefits from).

answered May 11 '17 at 06:31

Sebastian Utz

719
3
9

Thanks for your answer. I understand, that more memory will improve the performance, but the real question is: How much memory do I need, if I have round about 3TB of data (what is the real data amount we planned for this table)? Currently I am testing with only 16GB on a small system. All tests till now are showing, that we do not need much memory for storing data, but much memory for quering data. But also this was much lesser than we needed with ES/Kibana. That was one of the reasons we wanted to use Crate. But now I am not really sure about this decision :l. – Ragin May 11 '17 at 18:50
If the column type used for a distinct count is a fixed-size type, e.g. a long, you can multiple the number of distinct values by the required bytes this type needs (e.g. see https://github.com/crate/crate/blob/master/core/src/main/java/io/crate/types/LongType.java#L150). That would give you a rough estimation. If its a string type it will be much harder to estimate as the bytes a string requires depend on the length of the string value. – Sebastian Utz May 15 '17 at 13:04

Problems with distinct?

1 Answers1