Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.
The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ...
with a LOCATION ...
clause referencing the underlying files that hold the data.
In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/
) and set the hive.metastore.uri
parameter to the thrift service of an appropriate Hive metastore service.
The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.
If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.
Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.