Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

Tools to enable easy data summarization (ETL)
Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
A mechanism to put structure on this data
An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

Scalability (scale out with more machines added dynamically to the Hadoop cluster)
Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
Fault-tolerance
Loose-coupling with its input formats
Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

Add clear textual problem description.
Provide query and/or table DDL if applicable
Provide exception message
Provide input and desired output data example
Questions about query performance should include EXPLAIN query output.
Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
Use proper code and text formatting

Official links:

Useful Links:

21846 questions

votes

1 answer

How to get date difference in minutes using Hive

Below query is my sql server query and I want it to convert it into hive query: select DATEDIFF([minute], '19000101', '2013-01-01 10:10:10')

asked Nov 08 '15 at 10:45

displayname

votes

1 answer

create a schema in hive

How can I create multiple schema under a user defined DB in hive? I am trying to create a schema under a DB sam_db in hive. I am using the folowing code. use sam_db; create schema sam_db_schema; But when I visit the local host to see the filesystem…

hadoop hive hdfs

asked Mar 03 '14 at 10:48

Sam

2,545
8
38
59

votes

3 answers

Partition columns when inserting into a Hive table from a select

I was researching partitions in Hive and came upon: http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ In this link, the author says: “When inserting data into a partition, it’s necessary to include the partition columns as…

hadoop hive

asked Feb 08 '14 at 20:28

KS1234

votes

7 answers

How to check if a table exists in Hive?

I am connecting to Hive via an ODBC driver from a .NET application. Is there a query to determine if a table already exists? For example, in MSSQL you can query the INFORMATION_SCHEMA table and in Netezza you can query the _v_table table. Any…

sql odbc hive

asked Jan 17 '14 at 19:53

Michael Robinson

1,985
2
21
31

votes

5 answers

Insert data into hive table

Using a Cygwin distribution, I've installed Hadoop 0.20.3 and Hive 0.11.0. First of all, I don't understand how to use the Hive CLI: hive> show tables; Then enter and nothing happens. I can execute queries using hive -e/-f. Then, I've created a…

database hadoop hive

asked Jun 07 '13 at 09:32

Apaachee

votes

2 answers

Comments not working in hive CLI

I have seen other posts saying that double hyphen -- should be a hive comment. But at least within the hive CLI that is not working properly; hive> -- some comment; FAILED: Parse Error: line 0:-1 cannot recognize input near '' ''…

comments hive

asked Mar 24 '13 at 04:50

WestCoastProjects

58,982
91
316
560

votes

3 answers

HiveQL UNION ALL

I have table_A: id var1 var2 1 a b 2 c d Table_B: id var1 var2 3 e f 4 g h All I want is table, combined: id var1 var2 1 a b 2 c d 3 e f 4 g h This is my .hql: CREATE TABLE combined AS SELECT all.id,…

hadoop union hive hiveql

asked Dec 31 '12 at 03:02

dum_dum_dummy

votes

5 answers

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example: 2012-10-11 12:00,opened_browser,userid111,deviceid222 Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid".…

loading hive

asked Oct 30 '12 at 21:58

Valery Yesypenko

votes

1 answer

Hive enforces schema during read time?

What is the difference and meaning of these two statements that I encountered during a lecture here: 1. Traditional databases enforce schema during load time. and 2. Hive enforces schema during read time.

hadoop mapreduce hive hdfs

asked Aug 01 '12 at 17:13

London guy

27,522
44
121
179

votes

13 answers

SQL moving average

How do you create a moving average in SQL? Current table: Date Clicks 2012-05-01 2,230 2012-05-02 3,150 2012-05-03 5,520 2012-05-04 1,330 2012-05-05 2,260 2012-05-06 3,540 2012-05-07 …

sql hive moving-average

asked May 16 '12 at 18:58

Don P

60,113
114
300
432

votes

6 answers

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query: describe formatted I…

apache-spark hive apache-spark-sql

asked Jan 06 '19 at 10:27

code

5,294
16
62
113

votes

3 answers

How to optimize partitioning when migrating data from JDBC source?

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new…

apache-spark jdbc hive apache-spark-sql partitioning

asked Oct 02 '18 at 06:38

Metadata

2,127
9
56
127

votes

4 answers

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation. I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore. I create a session in my Spark program as follows: SparkSession…

apache-spark hive apache-spark-sql hiveql

asked Sep 04 '18 at 15:01

Anthony Arrascue

votes

2 answers

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming. Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added…

apache-spark hive hivecontext hive-partitions

asked Aug 06 '18 at 17:40

Ajith Kannan

votes

3 answers

Hive How to select all but one column?

Suppose my table looks something like: Col1 Col2 Col3.....Col20 Col21 Now I want to select all but Col21. I want to change it to unix_timestamp() before I insert into some other table. So the trivial approach is to do something like: INSERT INTO…

hive hiveql

asked Jul 08 '18 at 00:16

Rocking chief

1,039
3
17
31

Prev 1 2 3

…

99 100 Next