what are the downsides of using Presto in ETL scenarios?

Question

I have read that Presto is meant for ad-hoc querying and Hive/spark are more for apt for ETL scenarios. It seems the reason not to use Presto in ETL is because Presto queries can fail and there is no mid-query fault tolerance.

However, it looks like we can get around it by using Presto in the our daily Jenkins workflow along with retry in case of query failures. Has anybody tried using this way or are their any downsides to this approach?

If you are using Presto in ETL, how big were your Presto clusters ? What kind of EC2 instances you use for your presto cluster?

score 5 · Answer 1 · answered Aug 02 '18 at 16:03

Provided your ETL jobs are not very long or complex (i.e., standard SQL is sufficient for the transformations needed), I think Presto can do a reasonable job. As you pointed out, there is no mid-query fault tolerance so you need a mechanism to restart your queries upon failure. Hopefully, the speed of Presto will offset an occasional restart. One additional strategy is to break long complex queries into a series of shorter/simpler ones and create temporary tables in between to effectively achieve manual checkpointing. Such strategy was leveraged by Facebook when they migrated some of their batch Hive jobs to Presto: https://www.slideshare.net/kbajda/presto-at-hadoop-summit-2016

One additional recommendation I would make is to spin a separate Presto cluster for ETL to avoid any resource contention with your interactive Presto workload.

In terms of instance types, it obviously depends on your queries. Most often you want a good balance of RAM and CPU. Starting with R4 instance type is a good bet. Once you observe your workload at runtime, you can either add more nodes to speed up the ETL process or explore other instance types (e.g., if CPU is fully loaded, moving to C4/5 instance type might be a good bet).

More generally the Presto-Users mailing list is a good source of information: https://groups.google.com/group/presto-users. Also, learning from the community members at events such as Presto Summit (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/).

what are the downsides of using Presto in ETL scenarios?

1 Answers1