Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
44
votes
4 answers

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is…
PPPP
  • 561
  • 1
  • 4
  • 14
44
votes
7 answers

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode). The above code works fine, but I have so much data for each…
Chetandalal
  • 674
  • 1
  • 7
  • 18
43
votes
9 answers

COLLECT_SET() in Hive, keep duplicates?

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with…
batman
  • 1,447
  • 5
  • 16
  • 27
40
votes
6 answers

Hive cast string to date dd-MM-yyyy

How can I cast a string in the format 'dd-MM-yyyy' to a date type also in the format 'dd-MM-yyyy' in Hive? Something along the lines of: CAST('12-03-2010' as date 'dd-mm-yyyy')
pele88
  • 802
  • 2
  • 8
  • 16
39
votes
7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…
nickponline
  • 25,354
  • 32
  • 99
  • 167
38
votes
1 answer

what is HiveServer and Thrift server

I just started learning Hive.There are three terms which often I seen in Hive books or Hive tutorials. Hive Server,Hive Service and Thrift Server. What is these ? how they are related ?. what is the difference ?. when each of these are used? please…
37
votes
4 answers

Hive Alter table change Column Name

I am trying to rename a columnName in Hive. Is there a way to rename column name in Hive . tableA (column1 ,_c1,_c2) to tableA(column1,column2,column3) ??
user2978621
  • 803
  • 2
  • 11
  • 20
37
votes
5 answers

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; the data is loaded correctly…
Neels
  • 2,547
  • 6
  • 33
  • 40
37
votes
3 answers

Is there a way to make a multi line comment in hive scripts

I know that we can make a single line comment with '--' in hiveQL(hive.sql scripts) but is there a way to make multi line comments? I need something like below /* This sentence is a comment */
Karthick
  • 2,844
  • 4
  • 34
  • 55
36
votes
6 answers

select rows in sql with latest date for each ID repeated multiple times

I have a table where each ID is repeated 3 times. there is a date in front of each id in each row. I want to select entire row for each ID where date is latest. There are total 370 columns in this table i want all columns to get selected when i…
Earthshaker
  • 549
  • 1
  • 7
  • 12
35
votes
5 answers

Query HIVE table in pyspark

I am using CDH5.5 I have a table created in HIVE default database and able to query it from the HIVE command. Output hive> use default; OK Time taken: 0.582 seconds hive> show tables; OK bank Time taken: 0.341 seconds, Fetched: 1 row(s) hive>…
Chn
  • 369
  • 1
  • 4
  • 6
35
votes
2 answers

Querying on multiple Hive stores using Apache Spark

I have a spark application which will successfully connect to hive and query on hive tables using spark engine. To build this, I just added hive-site.xml to classpath of the application and spark will read the hive-site.xml to connect to its…
karthik manchala
  • 13,492
  • 1
  • 31
  • 55
35
votes
18 answers

java.lang.RuntimeException:Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

I have configured my Hive as given on link: http://www.youtube.com/watch?v=Dqo1ahdBK_A, but I am getting the following error while creating a table in Hive. I am using hadoop-1.2.1 and hive-0.12.0. hive> create table employee(emp_id int,name…
Raju Sharma
  • 2,496
  • 3
  • 23
  • 41
35
votes
3 answers

Skip first line of csv while loading in hive table

Hello Friends, I created table in hive with help of following command - CREATE TABLE db.test ( fname STRING, lname STRING, age STRING, mob BIGINT ) row format delimited fields terminated BY '\t' stored AS textfile;…
Pankaj
  • 369
  • 1
  • 4
  • 7
35
votes
3 answers

Can I change a table from internal to external in hive?

I created a table in hive as a managed table, but it was supposed to be external, is it possible to change the table type of the table without losing the data?
George TeVelde
  • 1,561
  • 2
  • 12
  • 13