Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
1 answer

Is there a compatibility mapping of Spark, hadoop and hive

Its confusing to understand version compatibility between different versions of Spark and hadoop. Similarly for Hadoop and hive. Is there any table that can be followed to know which version of one is compatible with which version of other ?
Puneet Chaurasia
  • 441
  • 6
  • 14
4
votes
1 answer

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps. I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in…
Alex Woolford
  • 4,433
  • 11
  • 47
  • 80
4
votes
2 answers

Create Hive index on complex column

It is possible to create an index on a complex column in hive. Complex as in map, struct, array, etc. columns. Example: CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY, deductions MAP
lief480
  • 140
  • 1
  • 11
4
votes
1 answer

Sqoop import : composite primary key and textual primary key

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 The source DB schema is on sql server and it contains several tables which either have primary key as : A varchar Composite - two varchar columns or one varchar + one int column or two int…
4
votes
2 answers

HIVE escaped by not working '\\'

I have a data-set in S3 123, "some random, text", "", "", 236 I build a external table on this dataset : CREATE EXTERNAL TABLE db1.myData( field1 bigint, field2 string, field3 string, field4 string, field5 bigint, ROW FORMAT…
underwood
  • 845
  • 2
  • 11
  • 22
4
votes
2 answers

Determine Hive version via query

With Apache Drill I can get the version through a JDBC connection by dispatching the query: SELECT version FROM sys.version. Is there an analogous way to determine the Hive version? I know I can use hive --version from a machine where Hive is…
Matt Pollock
  • 1,063
  • 10
  • 26
4
votes
0 answers

ERROR optimizer.ConstantPropagateProcFactory when querying an UDF

I get the following error output in hive when querying my Generic UDF: ERROR optimizer.ConstantPropagateProcFactory: Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@554286a4. Return value unrecoginizable. I get the error in…
Smicker
  • 41
  • 1
4
votes
4 answers

Create hive table from file stored in hdfs in orc format

I want to know if its possible create a hive table from a file stored in hadoop file system (users.tbl) in ORC format. I read that ORC format its better than text in terms of optimization. So I would like to know if its possible create a hive table…
jUsr
  • 301
  • 1
  • 4
  • 9
4
votes
2 answers

Is there any Spark Hook as Hive Hook

I am working on a project and have to track lineage of file transformation. assume one file called SomeTextFile.txt goes under multiple hive actions and in the final stage it produce some magnificent result as needed. Case:1 File went like(if i…
Sachin
  • 359
  • 2
  • 18
4
votes
0 answers

Managing input split sizes in Hive running the tez engine

I want to gain a better understanding of how in the input splits are calculated in the tez engine. I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for…
Nitin Kumar
  • 765
  • 1
  • 11
  • 26
4
votes
1 answer

How to union tables with different schema in HIVE?

I have two tables in HIVE: table A, which contains a column "N" which is of type array table B, in which column "N" does not appear both tables A and B contain column "C". I'd like to union them like this: select g.* from (select N, C from…
makansij
  • 9,303
  • 37
  • 105
  • 183
4
votes
1 answer

Hive - Count number of occurrences of character

I am trying to count number of occurrences of pipe symbol in Hive - (6) select length(regexp_replace('220138|251965797?AIRFR?150350161961|||||','^(?:[^|]*\\|)(\\|)','')) from smartmatching limit 10 This is what I am trying and I am not getting it…
Adithya Kumar
  • 159
  • 1
  • 2
  • 12
4
votes
2 answers

Can you change the format of dynamic partitions of Hive tables?

PRELUDE I'm using an external Hive table with dynamic partitioning. SET hive.exec.dynamic.partition = true SET hive.exec.dynamic.partition.mode = nonstrict The table looks something likt this: CREATE EXTERNAL TABLE `some_test`( `id` bigint, …
Sh4pe
  • 1,800
  • 1
  • 14
  • 30
4
votes
3 answers

failed to get database default returning NoSuchObjectException

When I start spark I get this warnings: Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_77) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. 16/04/03 15:07:31 WARN…
codin
  • 743
  • 5
  • 15
  • 27
1 2 3
99
100