4

Presto has multi connectors. While the connectors do implement read and write operations, from all the tutorials I read, it seems they are typically used as data sources to read from only. For example, netflix has "10 petabyte" of data on Amazon S3 and they explicitly state that no disk (and no HDFS) is used on the Presto worker nodes. The stated use case are "ad hoc interactive" queries.

Also, Amazon Athena is essentially S3+Presto and comes with similar use cases.

I'm puzzled how this can work in practice. Obviously, you don't want to read 10 PB of data on every query. So I assume, you want to keep some previously fetched data in memory, such as a database index. However, with no constraints on the data and the queries, I fail to understand how this can be efficient.

Use case 1: I run the same query frequently, e.g. to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?

Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?

Or, if this is not the case, would I be well advised to store intermediate results somewhere (e.g. CREATE TABLE AS ...)?

Guru Stron
  • 102,774
  • 10
  • 95
  • 132
Jan
  • 723
  • 1
  • 11
  • 24

2 Answers2

5

There is no data caching tier for Presto itself. To be honest, I don't think the features you are proposing here are supposed to be provided by Presto as a SQL analytics engine. For both of use cases you mentioned, I suggest deploying Alluxio together with Presto as a caching layer to help:

Use case 1: I run the same query frequently, e.g. to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?

As a caching layer, Alluxio can detect the data access pattern from Presto (or other applications) and make caching/eviction decisions to serve the most frequently used data in a memory tier (up to your configuration, can be SSD or HDD too). This will help when the data access is not consistent.

Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?

With more knowledge on your input data, you can enforce data policies in Alluxio to (1) preload data (common subqueries) into the caching space, (2) set TTL to retire data from Alluxio caching space to make room for other hot data, (3) set caching policies on certain input path (e.g., CACHE on certain paths, NO CACHE on some other paths).

Checkout more tips to run Presto/Alluxio stack: https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1/

apc999
  • 250
  • 3
  • 6
3

As far as I know there is no intermediate implicit caching layer. When you use HDFS on your cluster, you surely benefit from OS disk caches, so next query run will be faster, but you won't get instant cached results. Similar data block-level caching might apply to S3 too.

Generally, no reasonably-sized system can sift through 10 petabytes of data, since reading all that data would take a lot of time. However, data can be partitioned so that Presto knows more or less which pieces of data need to be scanned. When partitioning aligns with query conditions (e.g. you partition data by data and you query for most recent data), this can work really well.

When your data is not partitioned the same way you query, and you don't want to re-partition it differently, saving temporary results with create table ... as select makes much sense. You can also store such temporary tables using some in-memory storage, e.g. raptor (currently undocumented) or memory connectors for even faster access.

For some starter tips about partitioning, tuning storage and queries you can have a look at https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
  • Thank you, the link you provided helps: If the partitioning of the data is part of the data source configuration and the queries are aligned with this partitioning, then, of course, Athena/Presto can answer the request while reading only a subset of the data. – Jan Apr 04 '17 at 10:52