Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
3 answers

Cannot connect to hive database using Apache spark 2.1.0

I am using spark 2.1.0 version and trying to establish a connection with Hive tables. My hive data warehouse is in /user/hive/warehouse in hdfs, by listing contents of that folder i can see all the dbname.db folders in it. After some research i…
Justin
  • 735
  • 1
  • 15
  • 32
4
votes
4 answers

When should we go for partition and bucketing in hive?

I understand the concepts of partitioning and bucketing in Hive tables. But what I'd like to know is "when do we for partition and when do we go for bucketing ?" What are ideal scenarios that can be said as suitable for partitioning and bucketing ?
Metadata
  • 2,127
  • 9
  • 56
  • 127
4
votes
2 answers

I did alter table by hiveql. Then It didn't work to show table by spark-sql. ERROR : Path does not exist

I did alter table by HiveQL. "ALTER TABLE new_law_area_2 RENAME TO law_area" Then I intended to show my table by spark-sql. "SELECT * FROM law_area LIMIT 10" But, it didn't work... with this…
hyeon
  • 373
  • 2
  • 4
  • 16
4
votes
3 answers

Hive Query Execution Plan

Here is my hive query Insert into schemaB.employee partition(year) select * from schemaA.employee; Below is the Query Execution plan produced by this query. hive> explain ; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on…
outlier229
  • 481
  • 1
  • 7
  • 18
4
votes
1 answer

Using sparkDF.write.saveAsTable() inside the loop causes exponential increase in delay between the jobs

I need to execute a set of different hive queries inside a for loop. hc=HiveContext(sc) queryList=[set of queries] for i in range(0,X): hc.sql(queryList[i]) sparkDF.write.saveAsTable('hiveTable', mode='append') Though this code works like a…
Mike
  • 197
  • 1
  • 2
  • 15
4
votes
1 answer

sort_array order by a different column, Hive

I have two columns, one of products, and one of the dates they were bought. I am able to order the dates by applying the sort_array(dates) function, but I want to be able to sort_array(products) by the purchase date. Is there a way to do that in…
IDK
  • 53
  • 1
  • 7
4
votes
1 answer

How to unnest array with keys to join on afterwards?

I have two tables, namely table1 and table2. table1 is big, whereas table2 is small. Also, I have a UDF function whose interface is defined as below: --table1-- id 1 2 3 --table2-- category a b c d e f g UDF: foo(id: Int): List[String] I intend…
KAs
  • 1,818
  • 4
  • 19
  • 37
4
votes
2 answers

Hive - double precision

I have been working on hive and found something peculiar. Basically, while using double as a datatype for your column we need not have any precision specified (hive takes the precision dynamically depending on the value). But, here is the issue.…
Sri Bharath
  • 115
  • 1
  • 2
  • 10
4
votes
1 answer

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one. Following the instructions here…
domkck
  • 1,146
  • 1
  • 9
  • 19
4
votes
1 answer

Spark partitions - using DISTRIBUTE BY option

We have a Spark environment which should process 50MM rows. These rows contains a key column. The unique number of keys are close to 2000. I would like to process all of those 2000 keys in parallel. Thus we are using a Spark sql like the…
Balaji Krishnan
  • 437
  • 8
  • 27
4
votes
1 answer

Ignore quotes from CSV file while loading it into HIVE table

I have a csv file which has data in below format: "SomeName1",25,"SomeString1" "SomeName2",26,"SomeString2" "SomeName3",27,"SomeString3" I am loading this CSV into a hive table. In the table, column 1 and 3 get inserted together with the quotes…
earl
  • 738
  • 1
  • 17
  • 38
4
votes
1 answer

Creating Hive external table on AWS EMR

I am trying to create an external table in hive via hue on AWS EMR CREATE EXTERNAL TABLE IF NOT EXISTS urls ( id STRING, `date` TIMESTAMP, url STRING, expandedUrl STRING, domain STRING ) PARTITIONED BY (`year` INT, `month` INT, `day`…
roblovelock
  • 1,971
  • 2
  • 23
  • 41
4
votes
1 answer

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code: val df = spark.read.format("jdbc"). option("driver", "org.apache.hive.jdbc.HiveDriver"). option("user","hive"). option("password", ""). option("url",…
Rafzv
  • 41
  • 4
4
votes
1 answer

Group by to club multiple column value in hive to one column

I am trying to club multiple column value to one column based on group by key. Basically I will use this to create nested JSON using spark 1.6 dataframe api. Sample Input Table abc:- a b c d e f …
Ashish Singh
  • 523
  • 3
  • 14
4
votes
2 answers

How to build a hive table on data which is separated by '^P' delimiter

My query is: CREATE EXTERNAL TABLE gateway_staging ( poll int, total int, transaction_id int, create_time timestamp, update_time timestamp ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '^P'; (I am not sure whether '^P' can be used as a…
Andy Reddy
  • 93
  • 2
  • 9