Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
17
votes
3 answers

Hadoop: Python client driver for HiveServer2 fails to install

I am trying to install a Python client driver for HiveServer2: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver Installations says that: "A Python client driver for HiveServer2 is…
dokondr
  • 3,389
  • 12
  • 38
  • 62
17
votes
1 answer

Why is count(distinct) slower than group by in Hive?

On Hive, I believe count(distinct) will be more likely than group-by to result in an unbalanced workload to reducers and end up with one sad reducer grinding away. Example query below. Why? Example query: select count(distinct user) from…
dfrankow
  • 20,191
  • 41
  • 152
  • 214
17
votes
5 answers

Local Time Convert To UTC Time In Hive

I searched a lot on Internet but couldn't find the answer. Here is my question: I'm writing some queries in Hive. I have a UTC timestamp and would like to change it to UTC time, e.g., given timestamp 1349049600, I would like to convert it to UTC…
Iam619
  • 795
  • 2
  • 12
  • 28
17
votes
3 answers

SQL/Hive count distinct column

How do I do this in Hive? columnA columnB columnC 100.10 50.60 30 100.10 50.60 30 100.10 50.60 20 100.10 70.80 40 Output should be: columnA columnB …
user2441441
  • 1,237
  • 4
  • 24
  • 45
17
votes
7 answers

org.apache.hadoop.hbase.PleaseHoldException: Master is initializing

I am trying to setup the multinode cluster of Hbase. When i do the jps on slave i get 5780 Jps 5558 HQuorumPeer 5684 HRegionServer 1963 DataNode 2093 TaskTracker similarly on master i get 4254 SecondaryNameNode 15226 Jps 14982 HMaster 3907…
Naresh
  • 5,073
  • 12
  • 67
  • 124
17
votes
1 answer

How does Hive choose the number of reducers for a job?

Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers. When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one.…
dfrankow
  • 20,191
  • 41
  • 152
  • 214
17
votes
7 answers

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I…
17
votes
2 answers

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead. Long story short, the import of…
Yuriy
  • 1,964
  • 16
  • 23
16
votes
5 answers

Hive: Table creation with multi-files with multiple directories

I want to create a Hive table where the input textfiles are traversed onto multiple sub-directories in hdfs. So example I have in hdfs: /testdata/user/Jan/part-0001 /testdata/user/Feb/part-0001 /testdata/user/Mar/part-0001 and so…
user706794
  • 201
  • 1
  • 3
  • 6
16
votes
2 answers

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv…
Raja
  • 507
  • 1
  • 6
  • 24
16
votes
2 answers

Update , SET option in Hive

I know there is no update of file in Hadoop but in Hive it is possible with syntactic sugar to merge the new values with the old data in the table and then to rewrite the table with the merged output but if I have the new values in another table…
Jothi
  • 183
  • 1
  • 2
  • 6
16
votes
2 answers

Hive service, HiveServer2 & MetaStore service?

I am trying to understand hive in terms of architecture, and I am referring to Tom White's book on Hadoop. I came across the following terms in regards to hive: Hive Services , hiveserver2 , metastore among others. Referring to below diagrams from…
CuriousMind
  • 8,301
  • 22
  • 65
  • 134
16
votes
3 answers

How to connect to remote hive server from spark

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2…
April
  • 819
  • 2
  • 12
  • 23
16
votes
1 answer

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image. I'm trying to do this in Scala: import…
JFX
  • 432
  • 1
  • 4
  • 10
16
votes
3 answers

How to get all table definitions in a database in Hive?

I am looking to get all table definitions in Hive. I know that for single table definition I can use something like - describe <> describe extended <> But, I couldn't find a way to get all table definitions. Is there…
GoldenPlatinum
  • 427
  • 2
  • 4
  • 12