I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5% error) are acceptable.
Presto seems to support this with its approx_distinct(). As far as my understanding goes, it uses HyperLogLog for that. However, unless the data is persisted in rolled-up form, along with the HyperLogLog values, it would have to be computed on the fly. I do not think my queries would finish within a second for large datasets.
Does it support rollup with HyperLogLog computation at ingestion time (similar to Druid)? Given that unlike Druid, Presto queries the data from external stores (Hive/Cassandra/RDBMS etc.), I am not sure that ingestion time rollups are supported, unless Presto's native store supports them. Can someone please confirm?