Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
1 answer

Error applying authorization policy on hive configuration: Couldn't create directory ${system:java.io.tmpdir}\${hive.session.id}_resources

I run Hadoop 3.0.0-alpha1 on windows and added Hive 2.1.1 to it. When I try to open the hive beeline with the hive command I get an error: Error applying authorization policy on hive configuration: Couldn't create directory…
Benvorth
  • 7,416
  • 8
  • 49
  • 70
4
votes
1 answer

No KeyProvider is configured, cannot access an encrypted file

I have data in an encrypted zone in HDFS. I can read data with hive user, but when I create a hive table and try to query it via beeline I get this exception: Error: java.io.IOException: java.io.IOException: No KeyProvider is configured, cannot…
facha
  • 11,862
  • 14
  • 59
  • 82
4
votes
2 answers

How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

Setting textinputformat.record.delimiter to a non-default value, is useful for loading multi-row text, as shown in the demo below. However, I'm failing to set this parameter back to its default value without exiting the cli and reopen it. None of…
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
4
votes
1 answer

Hive Shell hangs and becomes unresponsive

My Hive shell hangs at logging initialization at configuration [cloudera@quickstart hive]$ hive 2017-03-01 08:23:50,909 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. …
Nandu
  • 237
  • 1
  • 2
  • 12
4
votes
2 answers

Substract Days to a Date in HIVE APACHE

How I can substract a number of days of a date, having as a result another date, for example: 01/12/2016 - 10 = 21/11/2016
Diego Arias
  • 93
  • 2
  • 6
4
votes
0 answers

What are the required steps to use Beeline to query a remote Hadoop instance?

I have a Hadoop cluster running on another server. I am able to ssh into that server and use Hive to run queries. I'm trying to determine if I can query that server remotely, using Hive or Beeline; would prefer Beeline, since it's not being…
jcollum
  • 43,623
  • 55
  • 191
  • 321
4
votes
1 answer

OrcRelation is not assignable to HadoopFsRelation

I am trying to run SparkSql on hive tables. But the problem I could not understand. Here is my code: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Row; import…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
4
votes
1 answer

How allow hive.mapred.mode=nonstrict?

I'm trying run this a query, with a JOIN without ON property. I'm running the query like: hive -v -f my_file.hql I got this message: In strict mode, cartesian product is not allowed. If you really want to perform the operation, set…
Alvaro Silvino
  • 9,441
  • 12
  • 52
  • 80
4
votes
1 answer

(Hive, SQL) - How to sort a list of string inside a column?

I have a big data problem in Hive (SQL). SELECT genre, COUNT(*) AS unique_count FROM table_name GROUP BY genre which gives result like: genre | unique_count ---------------------------------- Romance,Crime,Drama,Law |…
Afloz
  • 3,625
  • 3
  • 25
  • 31
4
votes
1 answer

Get all Hive table/database creation/deletion details (audit logs)

Lets say I have a database - project . I created a table named tab1 and then later tab2 . Now I dropped the table tab1. Where do I look for the logs that says I have dropped the table tab1 from databse project. I would like to get the time , user…
K S Nidhin
  • 2,622
  • 2
  • 22
  • 44
4
votes
1 answer

Hive Metastore column width limit

Using AWS EMR on the 5.2.1 version as data processing environment, when dealing with a huge JSON file that has a complex schema with many nested fields, Hive can't process it and errors as it reaches the current limit of 4000 characters column…
blamblam
  • 423
  • 6
  • 20
4
votes
1 answer

HIVE - ORC read Issue with NULL Decimal Values - java.io.EOFException: Reading BigInteger past EOF

I encountered an issue around HIVE when loading an ORC external table with NULLs inside a column that was defined as DECIMAL(31,8). It looks like hive is unable to read the ORC file after loading and can no longer view the records with a NULL inside…
Sidney
  • 41
  • 2
4
votes
1 answer

Setting Spark as default execution engine for Hive

Hadoop 2.7.3, Spark 2.1.0 and Hive 2.1.1. I am trying to set spark as default execution engine for hive. I uploaded all jars in $SPARK_HOME/jars to hdfs folder and copied scala-library, spark-core, and spark-network-common jars to HIVE_HOME/lib.…
Mahmud
  • 87
  • 10
4
votes
1 answer

Loading data in hive table with multiple charsets

I am facing issues where i have multiple files with different charsets, say one file has Chinese charsets and other has French Charsets, how can i load them in a single hive table? I searched online and found this :- ALTER TABLE mytable SET…
Paritosh Ahuja
  • 1,239
  • 2
  • 10
  • 19
4
votes
1 answer

how to implement avro alias

I am new to avro and while trying to use avro alias property i am getting below error. Query : select department_id , office_name from test.depart_alias; SemanticException [Error 10004]: Line 1:23 Invalid table alias or column reference…
Anaadih.pradeep
  • 2,453
  • 4
  • 18
  • 25