Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
19
votes
6 answers

Hive External table-CSV File- Header row

Below is the hive table i have created: CREATE EXTERNAL TABLE Activity ( column1 type,
column2 type ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/exttable/'; In my HDFS location /exttable, i have lot of CSV files and each CSV…
user1261215
18
votes
7 answers

how to write subquery and use "In" Clause in Hive

How can I use In clause in Hive I want to write something like this in Hive select x from y where y.z in (select distinct z from y) order by x; But I am not finding any way of doing it.. I tried In clause in Hive 0.7 it was throwing error, Also I…
user722856
  • 415
  • 1
  • 4
  • 14
18
votes
4 answers

Hive: dynamic partition adding to external table

I am running hive 071, processing existing data which is has the following directory layout: -TableName - d= (e.g. 2011-08-01) - d=2011-08-02 - d=2011-08-03 ... etc under each date I have the date files. now to load the data I'm using CREATE…
Tomer
  • 859
  • 3
  • 11
  • 19
18
votes
1 answer

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions,…
gdoron
  • 147,333
  • 58
  • 291
  • 367
18
votes
3 answers

Container killed by the ApplicationMaster Exit code is 143

I've been getting the following error in several cases: 2017-03-23 11:55:10,794 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1490079327128_0048_r_000003_0:…
Yuval
  • 211
  • 1
  • 2
  • 6
18
votes
3 answers

While airflow initdb, ImportError: cannot import name HiveOperator

I have recently installed airflow for my workflows. While creating my project, I executed following command: airflow initdb which returned following error: [2016-08-15 11:17:00,314] {__init__.py:36} INFO - Using executor SequentialExecutor DB:…
Rusty
  • 1,086
  • 2
  • 13
  • 27
18
votes
3 answers

What's the best way to write if/else if/else if/else in HIVE?

Hive uses IF(condition, expression, expression), so when I want to do if / else if / else if / else, I have to do: IF(a, 1, IF(b, 2, IF(c, 3, 4))) Is there a better way to do this that's more readable? Looking for something similar to the…
Popcorn
  • 5,188
  • 12
  • 54
  • 87
18
votes
4 answers

Main difference between dynamic and static partitioning in Hive

What is the main difference between static and dynamic partition in Hive? Using individual insert means static and single insert to partition table means dynamic. Is there any other advantage?
Ronak
  • 281
  • 1
  • 4
  • 17
18
votes
1 answer

Python Connection to Hive

I installed the Hortonworks Hive ODBC driver and created a connection in the Data sources. I tested it and it worked successfully. I installed PyODBC and wrote the following code import os, sys, pyodbc; con = pyodbc.connect("DSN=MyCon") I got…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
18
votes
2 answers

Why can't hive recognize alias named in select part?

Here's the scenario: When I invoke hql as follows, it tells me that it cannot find alias for u1. hive> select user as u1, url as u2 from rank_test where u1 != ""; FAILED: SemanticException [Error 10004]: Line 1:50 Invalid table alias or column…
Judking
  • 6,111
  • 11
  • 55
  • 84
18
votes
3 answers

Parse json arrays using HIVE

I have many json arrays stored in a table (jt) that looks like this: [{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}] [{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}] Each array is a…
marlieg
  • 449
  • 2
  • 5
  • 16
18
votes
5 answers

Loading data from one Hive table to another with partition

I have data in one Hive table and would like to load data into another hive table. The source table is reg_logs which has 2 partitions, date and hour. The data gets loaded into this table hourly. The schema is: CREATE EXTERNAL TABLE IF NOT EXISTS…
Rio
  • 765
  • 3
  • 17
  • 37
18
votes
5 answers

How to add partition using hive by a specific date?

I'm using hive (with external tables) to process data stored on amazon S3. My data is partitioned as follows: DIR s3://test.com/2014-03-01/ DIR s3://test.com/2014-03-02/ DIR …
Brisi
  • 1,781
  • 7
  • 26
  • 41
18
votes
6 answers

Handling NULL values in Hive

I am trying to create a table (table 2) in Hive from another table (table 1). I am trying to exclude certain rows with NULL values and tried the following condition. insert overwrite table table2 partition (date = '2013-06-01') select…
Ravi
  • 3,223
  • 7
  • 37
  • 49
18
votes
7 answers

How to identify which database the user is using in hive CLI ?

Is there any way to identify which database the user is using while using the command line interface ?
Nithin
  • 9,661
  • 14
  • 44
  • 67