Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
18
votes
6 answers

getting null values while loading the data from flat files into hive tables

I am getting the null values while loading the data from flat files into hive tables. my tables structure is like this: hive> create table test_hive (id int,value string); and my flat file is like this: input.txt 1 a 2 b 3 c 4 d 5 e 6 …
user1823697
  • 189
  • 1
  • 1
  • 3
18
votes
2 answers

Hive query results in vertical format like MySQL's "\G"?

Is there a way to get Hive to output the results in a columnar-fashion, like the "\G" option available from MySQL? http://dev.mysql.com/doc/refman//5.5/en/mysql-commands.html
Idr
  • 6,000
  • 6
  • 34
  • 49
18
votes
5 answers

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there…
Ananda
  • 1,572
  • 7
  • 27
  • 54
17
votes
3 answers

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop. My question is: Hadoop is installed on all the cluster nodes.…
Vijay
  • 263
  • 1
  • 4
  • 12
17
votes
9 answers

Check if table exists in hive metastore using Pyspark

I am trying to check if a table exists in hive metastore if not, create the table. And if the table exists, append data. I have a snippet of the code below: spark.catalog.setCurrentDatabase("db_name") db_catalog = spark.catalog.listTables(dbName =…
Cryssie
  • 3,047
  • 10
  • 54
  • 81
17
votes
4 answers

Setup Standalone Hive Metastore Service For Presto and AWS S3

I'm working in an environment where I have an S3 service being used as a data lake, but not AWS Athena. I'm trying to setup Presto to be able to query the data in S3 and I know I need the define the data structure as Hive tables through the Hive…
mhaken
  • 1,075
  • 4
  • 14
  • 28
17
votes
5 answers

How to create hive table from Spark data frame, using its schema?

I want to create a hive table using my Spark dataframe's schema. How can I do that? For fixed columns, I can use: val CreateTable_query = "Create Table my table(a string, b string, c double)" sparksession.sql(CreateTable_query) But I have many…
lserlohn
  • 5,878
  • 10
  • 34
  • 52
17
votes
2 answers

What does "WITH SERDEPROPERTIES ( 'paths' = 'key1, key2, key3') " really do in Hive DDL json serde?

Much appreciated if anyone can provide a reference to this clause. I have been searching online with little luck.
Da Qi
  • 615
  • 5
  • 10
17
votes
3 answers

Delete a database with tables in Hive

I have a database in hive which has around 100 tables. I would like to delete the whole database in a single shot query. How can we achieve that in Hive?
user7351648
17
votes
6 answers

How to quit beeline?

I am using CDH 5.5 and need to use beeline. I am pretty new to it and learning it now. I can start beeline but cannot quit as we do in Hive. I need to use Ctrl+z to quit which is not the proper way. Can someone help?
user4503253
17
votes
7 answers

Unable to exit Hive

I've just installed Hive on my Ubuntu machine (14.04). When I run hive in the terminal, it comes up with Logging initialized using configuration in jar:file:/home/nkhl/Documents/apachehive/lib/hive-common-1.2.1.jar!/hive-log4j.properties which is…
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47
17
votes
3 answers

REGEXP_REPLACE capturing groups

I was wondering if someone could help me understand how to use Hive's regexp_replace function to capture groups in the regex and use those groups in the replacement string. I have an example problem I'm working through below that involves…
jatal
  • 790
  • 1
  • 10
  • 19
17
votes
3 answers

Hive creating a table but getting FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

Here is the code I am using to create the table: CREATE TABLE vi_vb(cTime STRING, VI STRING, Vital STRING, VB STRING) PARTITIONED BY(cTime STRING, VI STRING) CLUSTERED BY(VI) SORTED BY(cTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS…
user3121369
  • 417
  • 2
  • 6
  • 13
17
votes
1 answer

Does JDBC have a maximum ResultSet size?

Is there a maximum number of rows that a JDBC will put into a ResultSet specifically from a Hive query? I am not talking about fetch size or paging, but the total number of rows returned in a ResultSet. Correct me if I'm wrong, but the fetch size…
sparks
  • 736
  • 1
  • 9
  • 29
17
votes
2 answers

Hive dynamic partitioning

I'm trying to create a partitioned table using dynamic partitioning, but i'm facing an issue. I'm running Hive 0.12 on Hortonworks Sandbox 2.0. set hive.exec.dynamic.partition=true; INSERT OVERWRITE TABLE demo_tab PARTITION (land) SELECT stadt,…
Baeumla
  • 443
  • 3
  • 6
  • 18