Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
3 answers

How do I find what user owns a HIVE database?

I want to confirm which user is the owner of a database in HIVE. Where would I find this information?
RagePwn
  • 411
  • 2
  • 8
  • 22
4
votes
2 answers

handle subfolders after partitions in hive

I got my directory structure as follows. /data/year=/month=/day=/source1/abc.log /data/year=/month=/day=/source2/def.log /data/year=/month=/day=/source3/xyz.log I wanted to create a hive table with year, month, date as partitions but it is…
kamoor
  • 2,909
  • 2
  • 19
  • 34
4
votes
3 answers

Read multiple files in Hive table by date range

Let's imagine I store one file per day in a format: /path/to/files/2016/07/31.csv /path/to/files/2016/08/01.csv /path/to/files/2016/08/02.csv How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to…
Dmitry Petrov
  • 1,490
  • 1
  • 19
  • 34
4
votes
3 answers

zeppelin hive interpreter throws ClassNotFoundException

I have deployed zeppelin 0.6 and configured hive under Jdbc interpreter. Tried executing %hive show databases Throws: org.apache.hive.jdbc.HiveDriver class java.lang.ClassNotFoundException …
Immanuel Fredrick
  • 508
  • 3
  • 9
  • 20
4
votes
0 answers

Reading BLOB data which is stored as Binary datatype in Hive

We have Oracle BLOB and VARBINARY (SQL Server/Progress) data in hive which is stored as String or Binary datatype. We have brought data from respective RDBMS using sqoop. Now that we have data in hdfs, we like to see the actual attachments like pdf…
Despicable me
  • 548
  • 1
  • 9
  • 24
4
votes
1 answer

Converting columns to rows (UNPIVOT) in hiveql

I have a table with a structure like this: column1, column2, column3, X1, X2, X3, X4 A1, A2, A3, 5, 6, 1, 4 I would like to convert this into column1, column2, column3, Key, Value A1, A2, A3, X1, 5 A1, A2,…
NG Algo
  • 3,570
  • 2
  • 18
  • 27
4
votes
1 answer

airflow get result after executing an operator

I have configured airflow and created some Dags and subDags that call several operators. My trouble is that when an operators runs and finishes the job, I'd like to receive the results back in some python structure. For instance: File1.py ... …
Alg_D
  • 2,242
  • 6
  • 31
  • 63
4
votes
1 answer

Using Hive with Pig

My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations. Is there a way I could query hive tables or read hive…
uHadoop
  • 447
  • 1
  • 5
  • 7
4
votes
3 answers

How to create a table with dates in sequence between range in Hive?

I'm trying to Create a table with column date, And I want to insert date in sequence between Range. Here's what I have tried: SET StartDate = '2009-01-01'; SET EndDate = '2016-06-31'; CREATE TABLE DateRangeTable(mydate DATE, qty INT); INSERT INTO…
Vitthal
  • 546
  • 3
  • 18
4
votes
1 answer

Execute multiple hive queries using single execute() method of statement class using java

I am using Java API to access HiveServer2, I have requirement of executing multiple hive queries in single call to execute() method of statements class. Is it possible to submit multiple queries of hive in one call to execute() method. I have hive…
Vaijnath Polsane
  • 647
  • 1
  • 8
  • 22
4
votes
1 answer

Error when starting a hive thrift server on EMR

In the following code I'm trying to start a hive thrift server from spark: val conf = new SparkConf().setAppName("HiveDemo") val sc = new SparkContext(conf) val sql = new HiveContext(sc) sql.setConf("hive.server2.thrift.port", "10001") val df =…
djWann
  • 2,017
  • 4
  • 31
  • 36
4
votes
1 answer

Insert data into a Hive table with HiveContext using Spark Scala

I was able to insert data into a Hive table from my spark code using HiveContext like below val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS e360_models.employee(id INT, name STRING, age…
yAsH
  • 3,367
  • 8
  • 36
  • 67
4
votes
1 answer

Hive - Split delimited columns over multiple rows, select based on position

I'm Looking for a way to split the column based on comma delimited data. Below is my dataset id col1 col2 1 5,6 7,8 I want to get the result id col1 col2 1 5 7 1 6 8 The position of the index should match because I need to fetch…
divyabharathi
  • 2,187
  • 17
  • 12
4
votes
1 answer

Hive Insert overwrite into Dynamic partition external table from a raw external table failed with null pointer exception.,

I have a raw external table with four columns- Table 1 : create external table external_partitioned_rawtable (age_bucket String,country_destination String,gender string,population_in_thousandsyear int) row format delimited fields terminated…
Barath
  • 107
  • 2
  • 14
4
votes
5 answers

Slowly changing dimensions- SCD1 and SCD2 implementation in Hive

I am looking for SCD1 and SCD2 implementation in Hive (1.2.1). I am aware of the workaround to load SCD1 and SCD2 tables prior to Hive (0.14). Here is the link for loading SCD1 and SCD2 with the workaround approach…
Lijju Mathew
  • 1,911
  • 6
  • 20
  • 26
1 2 3
99
100