Presto's support for approx_distinct

Question

I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5% error) are acceptable.

Presto seems to support this with its approx_distinct(). As far as my understanding goes, it uses HyperLogLog for that. However, unless the data is persisted in rolled-up form, along with the HyperLogLog values, it would have to be computed on the fly. I do not think my queries would finish within a second for large datasets.

Does it support rollup with HyperLogLog computation at ingestion time (similar to Druid)? Given that unlike Druid, Presto queries the data from external stores (Hive/Cassandra/RDBMS etc.), I am not sure that ingestion time rollups are supported, unless Presto's native store supports them. Can someone please confirm?

Piotr Findeisen · Answer 1 · 2019-02-01T19:02:45.990

1

There isn't such a thing as "Presto's native store". Presto is query execution engine with connector architecture allowing plugging in multiple storage layers.

If you want an approximate count-distinct for a whole data set, you can compute table stats (When using Presto with Hive, this currently needs to be done in Hive).

If you want an approximate count-distinct for a dynamic selection of data, you still need to read the data. Then you won't get to second latency with such big data set. However, you can combine approx_distinct (or use plain count(distinct ..)) with TABLESAMPLE to limit the size of data read.

edited Feb 01 '19 at 19:02

answered Aug 14 '17 at 14:04

Piotr Findeisen

19,480
2
52
82

Thanks @piotr-findeisen. The queries will be approximate count-distinct queries on a subset of dimensions, with filters on one or more of the dimensions. Regarding Presto's native store, I was referring to the passing mention in [this link](https://news.ycombinator.com/item?id=6684678). Was wondering if it has been productionized by now. Thanks for the pointer to TABLESAMPLE. Will explore it. – Ameya Aug 16 '17 at 03:36
Perhaps this is about Raptor (connector and data storage in one), but I won't be elaborate on that. – Piotr Findeisen Aug 16 '17 at 07:18

score 0 · Answer 2 · answered Sep 22 '17 at 22:59

You can try with Verdict, which can significantly reduce query processing cost by applying statistics and approximate query processing, yielding 99.9% accuracy. It runs on all SQL-based engines including Apache Hive, Apache Impala, Apache Spark, Amazon Redshift, etc..

You can download source code from here. After downloading and some simple setup, you can issue query as you normally do and get results in a much shorter time.

Presto's support for approx_distinct

2 Answers2