Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
78
votes
16 answers

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive. Thanks hive> delete from…
Charnjeet Singh
  • 3,056
  • 6
  • 35
  • 65
77
votes
12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?
Yuval
  • 7,987
  • 12
  • 40
  • 54
77
votes
10 answers

I have created a table in hive, I would like to know which directory my table is created in?

I have created a table in hive, I would like to know which directory my table is created in? I would like to know the path...
Muneer Basha Syed
  • 789
  • 1
  • 6
  • 5
75
votes
5 answers

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. "Show partitions" command only lists a small number of partitions. How can i show all partitions? Update: I found "show partitions" command only lists exactly 500 partitions. "select ... where ..." only…
Kevin Leo
  • 850
  • 1
  • 7
  • 9
74
votes
16 answers

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like INSERT INTO tablename VALUES (value1,value2..) I have read that you can load the data from a file to…
Y0gesh Gupta
  • 2,184
  • 5
  • 40
  • 56
71
votes
6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…
yoni
  • 5,686
  • 3
  • 27
  • 28
70
votes
17 answers

How to export a Hive table into a CSV file?

I used this Hive query to export a table into a CSV file. INSERT OVERWRITE DIRECTORY '/user/data/output/test' select column1, column2 from table1; The file generated '000000_0' does not have comma separator Is this the right way to generate CSV…
Dunith Dhanushka
  • 4,139
  • 6
  • 26
  • 29
67
votes
8 answers

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY

On Hive, for Data Retrieval Queries (e.g. SELECT ...), NOT Data Definition (e.g. CREATE TABLES ...), as far as I understand: SORT BY only sorts with in the reducer ORDER BY orders things globally but shoves everything into one reducers CLUSTER BY…
cashmere
  • 2,811
  • 1
  • 23
  • 32
66
votes
3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…
user2205916
  • 3,196
  • 11
  • 54
  • 82
63
votes
2 answers

What's the difference between -DskipTests and -Dmaven.test.skip=true

I was trying to build hive-0.13. When using -Dmaven.test.skip=true, it will not build the test jars but it will check test dependency. When using -DskipTests, it will not build the test jars and also not check test dependency. What's the difference…
Stanley Shi
  • 679
  • 1
  • 5
  • 9
62
votes
8 answers

How to skip CSV header in Hive External Table?

I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that. CREATE EXTERNAL TABLE Test ( RecordId int, FirstName…
Rick Gittins
  • 1,138
  • 1
  • 8
  • 24
61
votes
18 answers

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I have Hadoop 2.7.1 and apache-hive-1.2.1 versions installed on ubuntu 14.0. Why this error is occurring ? Is any metastore installation required? When we typing hive command on terminal how the xml's internally called, what is the flow of those…
Arti Nalawade
  • 957
  • 1
  • 7
  • 8
61
votes
5 answers

Just get column names from hive table

I know that you can get column names from a table via the following trick in hive: hive> set hive.cli.print.header=true; hive> select * from tablename; Is it also possible to just get the column names from the table? I dislike having to change a…
cantdutchthis
  • 31,949
  • 17
  • 74
  • 114
58
votes
5 answers

How does impala provide faster query response compared to hive

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are…
techuser soma
  • 4,766
  • 5
  • 23
  • 43
57
votes
3 answers

How to load data to hive from HDFS without removing the source file?

When load data from HDFS to Hive, using LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename; command, it looks like it is moving the hdfs_file to hive/warehouse dir. Is it possible (How?) to copy it instead of moving it, in order, for the file, to…
Suge
  • 2,808
  • 3
  • 48
  • 79