7

I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.

I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.

IT throws an error saying cannot find the file scheme type s3.

And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.

This and this are the documentations i've followed while set up.

Codex
  • 569
  • 2
  • 6
  • 22
  • A bit away from you question, but why are you not using aws EMR? all these configurations are there out of the box and as far as I know presto, it needs a cluster to perform well, a single ec2 instance is not enough. One more note: if you don't want to launch a cluster to run presto, you can use aws Athena, which is a service provided by amazon that provides Presto as a service. Athena pricing is per data scanned, so if your data is small then you are off cost, they charge you 5$ per 1 TB scanned. I strongly suggest it if you are just experimenting Presto – Abdulhafeth Sartawi Jul 05 '18 at 09:26

1 Answers1

7

Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.

The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ... with a LOCATION ... clause referencing the underlying files that hold the data.

In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/) and set the hive.metastore.uri parameter to the thrift service of an appropriate Hive metastore service.

The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.

If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.

Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.

Euan
  • 559
  • 4
  • 10
  • Thanks for your reply. I want to try presto with absolutely no cost associated. Please check the edits , I had more doubts after going through the relevant material you just mentioned. – Codex May 13 '16 at 09:32
  • With Amazon EMR I would incur costs so i'm trying to avoid that. Is there any way around for this. – Codex May 13 '16 at 09:39
  • 2
    I wrote the following [post](http://blog.danielcorin.com/code/2016/04/11/querying-s3-with-presto.html) last year on the topic. I haven't tried the setup since but there a chance it could help. The general idea is to use a Docker container as the Hive metastore so you don't need a managed service like EMR just for the purposes of routing your Presto queries. – Daniel Corin Feb 22 '17 at 06:40
  • 1
    @Euan is any of that information (like how EMR puts config in a special folder that you can't find in presto docs) available anywhere, or is it all tribal knowledge at this point? Trying to get a basic Presto -> Hive -> S3 setup working and it's surprisingly awful. – a p May 26 '17 at 18:01