1

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:

  • Files are stored in Hadoop (some kind of file manager)
  • Hive needs tables to store data from Hadoop (data manager)
    • Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?) -> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
  • Can Presto be used without Hive and just on Hadoop directly?

Thanks in advance for answering my questions :)

Damien Carol
  • 1,164
  • 1
  • 11
  • 21
darkownage
  • 938
  • 16
  • 38

1 Answers1

3

First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?

Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).

Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.

Please read more info about Hive connector configuration here and about connector plugins here.

Viacheslav Rodionov
  • 2,335
  • 21
  • 22
  • I just want to be able to query the data in Hadoop, so I guess I need Hive tables that contain metadata about the files in Hadoop? – darkownage Jan 24 '14 at 10:02
  • @darkownage I think you need **EXTERNAL TABLES**. "The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated." `CREATE EXTERNAL TABLE table1(id INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '';` [source](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables) – Viacheslav Rodionov Jan 24 '14 at 12:05