Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
16
votes
1 answer

How to get date difference in minutes using Hive

Below query is my sql server query and I want it to convert it into hive query: select DATEDIFF([minute], '19000101', '2013-01-01 10:10:10')
displayname
  • 317
  • 1
  • 5
  • 15
16
votes
1 answer

create a schema in hive

How can I create multiple schema under a user defined DB in hive? I am trying to create a schema under a DB sam_db in hive. I am using the folowing code. use sam_db; create schema sam_db_schema; But when I visit the local host to see the filesystem…
Sam
  • 2,545
  • 8
  • 38
  • 59
16
votes
3 answers

Partition columns when inserting into a Hive table from a select

I was researching partitions in Hive and came upon: http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ In this link, the author says: “When inserting data into a partition, it’s necessary to include the partition columns as…
KS1234
  • 161
  • 1
  • 1
  • 4
16
votes
7 answers

How to check if a table exists in Hive?

I am connecting to Hive via an ODBC driver from a .NET application. Is there a query to determine if a table already exists? For example, in MSSQL you can query the INFORMATION_SCHEMA table and in Netezza you can query the _v_table table. Any…
Michael Robinson
  • 1,985
  • 2
  • 21
  • 31
16
votes
5 answers

Insert data into hive table

Using a Cygwin distribution, I've installed Hadoop 0.20.3 and Hive 0.11.0. First of all, I don't understand how to use the Hive CLI: hive> show tables; Then enter and nothing happens. I can execute queries using hive -e/-f. Then, I've created a…
Apaachee
  • 900
  • 2
  • 10
  • 32
16
votes
2 answers

Comments not working in hive CLI

I have seen other posts saying that double hyphen -- should be a hive comment. But at least within the hive CLI that is not working properly; hive> -- some comment; FAILED: Parse Error: line 0:-1 cannot recognize input near '' ''…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
16
votes
3 answers

HiveQL UNION ALL

I have table_A: id var1 var2 1 a b 2 c d Table_B: id var1 var2 3 e f 4 g h All I want is table, combined: id var1 var2 1 a b 2 c d 3 e f 4 g h This is my .hql: CREATE TABLE combined AS SELECT all.id,…
dum_dum_dummy
  • 161
  • 1
  • 1
  • 3
16
votes
5 answers

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example: 2012-10-11 12:00,opened_browser,userid111,deviceid222 Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid".…
Valery Yesypenko
  • 412
  • 3
  • 6
  • 14
16
votes
1 answer

Hive enforces schema during read time?

What is the difference and meaning of these two statements that I encountered during a lecture here: 1. Traditional databases enforce schema during load time. and 2. Hive enforces schema during read time.
London guy
  • 27,522
  • 44
  • 121
  • 179
16
votes
13 answers

SQL moving average

How do you create a moving average in SQL? Current table: Date Clicks 2012-05-01 2,230 2012-05-02 3,150 2012-05-03 5,520 2012-05-04 1,330 2012-05-05 2,260 2012-05-06 3,540 2012-05-07 …
Don P
  • 60,113
  • 114
  • 300
  • 432
15
votes
6 answers

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query: describe formatted I…
code
  • 5,294
  • 16
  • 62
  • 113
15
votes
3 answers

How to optimize partitioning when migrating data from JDBC source?

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new…
Metadata
  • 2,127
  • 9
  • 56
  • 127
15
votes
4 answers

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation. I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore. I create a session in my Spark program as follows: SparkSession…
Anthony Arrascue
  • 220
  • 1
  • 2
  • 13
15
votes
2 answers

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming. Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added…
Ajith Kannan
  • 812
  • 1
  • 8
  • 30
15
votes
3 answers

Hive How to select all but one column?

Suppose my table looks something like: Col1 Col2 Col3.....Col20 Col21 Now I want to select all but Col21. I want to change it to unix_timestamp() before I insert into some other table. So the trivial approach is to do something like: INSERT INTO…
Rocking chief
  • 1,039
  • 3
  • 17
  • 31