Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

Tools to enable easy data summarization (ETL)
Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
A mechanism to put structure on this data
An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

Scalability (scale out with more machines added dynamically to the Hadoop cluster)
Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
Fault-tolerance
Loose-coupling with its input formats
Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

Add clear textual problem description.
Provide query and/or table DDL if applicable
Provide exception message
Provide input and desired output data example
Questions about query performance should include EXPLAIN query output.
Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
Use proper code and text formatting

Official links:

Useful Links:

21846 questions

votes

16 answers

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive. Thanks hive> delete from…

hadoop hive sql-delete

asked Jul 23 '13 at 12:44

Charnjeet Singh

3,056
6
35
65

votes

12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?

hadoop hive hdfs

asked Feb 20 '11 at 16:43

Yuval

7,987
12
40
54

votes

10 answers

I have created a table in hive, I would like to know which directory my table is created in?

I have created a table in hive, I would like to know which directory my table is created in? I would like to know the path...

hive hiveql

asked Nov 01 '12 at 13:33

Muneer Basha Syed

votes

5 answers

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. "Show partitions" command only lists a small number of partitions. How can i show all partitions? Update: I found "show partitions" command only lists exactly 500 partitions. "select ... where ..." only…

hadoop hive

asked Mar 25 '13 at 13:34

Kevin Leo

votes

16 answers

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like INSERT INTO tablename VALUES (value1,value2..) I have read that you can load the data from a file to…

sql hadoop hive hiveql

asked Jul 02 '13 at 12:20

Y0gesh Gupta

2,184
5
40
56

votes

6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…

java testing hadoop mapreduce hive

asked May 23 '13 at 16:47

yoni

5,686
3
27
28

votes

17 answers

How to export a Hive table into a CSV file?

I used this Hive query to export a table into a CSV file. INSERT OVERWRITE DIRECTORY '/user/data/output/test' select column1, column2 from table1; The file generated '000000_0' does not have comma separator Is this the right way to generate CSV…

csv hive

asked Jun 13 '13 at 12:04

Dunith Dhanushka

4,139
6
26
29

votes

8 answers

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY

On Hive, for Data Retrieval Queries (e.g. SELECT ...), NOT Data Definition (e.g. CREATE TABLES ...), as far as I understand: SORT BY only sorts with in the reducer ORDER BY orders things globally but shoves everything into one reducers CLUSTER BY…

hadoop hql hive

asked Dec 05 '12 at 01:42

cashmere

2,811
1
23
32

votes

3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…

apache-spark hive pyspark apache-spark-sql hiveql

asked Oct 20 '16 at 18:27

user2205916

3,196
11
54
82

votes

2 answers

What's the difference between -DskipTests and -Dmaven.test.skip=true

I was trying to build hive-0.13. When using -Dmaven.test.skip=true, it will not build the test jars but it will check test dependency. When using -DskipTests, it will not build the test jars and also not check test dependency. What's the difference…

java maven hive

asked Sep 03 '14 at 08:08

Stanley Shi

votes

8 answers

How to skip CSV header in Hive External Table?

I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that. CREATE EXTERNAL TABLE Test ( RecordId int, FirstName…

hive

asked Apr 01 '13 at 21:13

Rick Gittins

1,138
1
8
24

votes

18 answers

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I have Hadoop 2.7.1 and apache-hive-1.2.1 versions installed on ubuntu 14.0. Why this error is occurring ? Is any metastore installation required? When we typing hive command on terminal how the xml's internally called, what is the flow of those…

apache hadoop hive

asked Feb 17 '16 at 06:19

Arti Nalawade

votes

5 answers

Just get column names from hive table

I know that you can get column names from a table via the following trick in hive: hive> set hive.cli.print.header=true; hive> select * from tablename; Is it also possible to just get the column names from the table? I dislike having to change a…

sql hadoop hive

asked Oct 03 '14 at 14:59

cantdutchthis

31,949
17
74
114

votes

5 answers

How does impala provide faster query response compared to hive

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are…

hadoop hive impala

asked May 26 '13 at 02:07

techuser soma

4,766
5
23
43

votes

3 answers

How to load data to hive from HDFS without removing the source file?

When load data from HDFS to Hive, using LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename; command, it looks like it is moving the hdfs_file to hive/warehouse dir. Is it possible (How?) to copy it instead of moving it, in order, for the file, to…

hadoop hive

asked Sep 27 '11 at 10:23

Suge

2,808
3
48
79

Prev 1

…

99 100 Next