Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
20
votes
2 answers

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', …
ln9187
  • 730
  • 1
  • 7
  • 23
20
votes
2 answers

'' in subquery source in Hive query

I am running a query on Hive similar to: SELECT * FROM (SELECT a FROM b WHERE sex = 'M' AND degree = 'Bs' AND age = 15 AND name LIKE 'L%' ); the error is: cannot recognize input near '' ''…
Cyrus
  • 912
  • 2
  • 11
  • 21
20
votes
2 answers

Add PARTITION after creating TABLE in hive

i have created a non partitioned table and load data into the table,now i want to add a PARTITION on the basis of department into that table,can I do this? If I do: ALTER TABLE Student ADD PARTITION (dept='CSE') location '/test'; It gives me…
rinku buragohain
  • 329
  • 2
  • 7
  • 15
20
votes
1 answer

Hive padding leading zeroes

I need the output of a string column in my table as 13 length char, irrespective of whatever length it is, i need to stuff the remaining chars with 0... I tried to use the following code in my hive query, but failed to get the desired…
Muthu Palaniappan
  • 221
  • 1
  • 2
  • 5
20
votes
7 answers

Hive date function to achieve day of week

I'm looking for a workaround or hive date functions that gives day of the week , Sunday - 1 Monday - 2 Tuesday - 3 Wednesday - 4 Thursday - 5 Friday - 6 Saturday - 7 Requirement in detail : I'm looking for a function that takes date string…
user3279189
  • 1,643
  • 8
  • 22
  • 35
20
votes
6 answers

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. I went through few articles which basically points out that Hive is for structured processing…
Maverick
  • 484
  • 2
  • 9
  • 20
19
votes
2 answers

Hive throws: WstxParsingException: Illegal character entity: expansion character (code 0x8)

I installed hadoop and hive following this tutorial. The tutorial is outdated so I used the current versions. When I start hive I get the following error: hive SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]…
Soerendip
  • 7,684
  • 15
  • 61
  • 128
19
votes
1 answer

Spark without Hadoop: Failed to Launch

I'm running Spark 2.1.0, Hive 2.1.1 and Hadoop 2.7.3 on Ubuntu 16.04. I download the Spark project from github and build the "without hadoop" version: ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz …
Top.Deck
  • 1,077
  • 3
  • 16
  • 31
19
votes
3 answers

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark…
Euan
  • 559
  • 4
  • 10
19
votes
4 answers

How to find the most recent partition in HIVE table

I have a partitioned table - with 201 partitions. I need to find latest partition in this table and use it to post process my data. The query to find list of all partitions is : use db; show partitions table_name; I need a query to find the…
underwood
  • 845
  • 2
  • 11
  • 22
19
votes
8 answers

Dropping multiple tables with same prefix in Hive

I have few tables in hive that has same prefix like below.. temp_table_name temp_table_add temp_table_area There are few hundreds of tables like this in my database along with many other tables. I want to delete tables that starts with…
Alex Raj Kaliamoorthy
  • 2,035
  • 3
  • 29
  • 46
19
votes
1 answer

Attempt to do update or delete using transaction manager that does not support these operations

While trying to update a data in Hive table in Cloudera Quickstart VM, I'm getting this error. Error while compiling statement: FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support…
Vivek Harry
  • 419
  • 3
  • 11
  • 25
19
votes
4 answers

Hive from_unixtime for milliseconds

We have a timestamp epoch column (BIGINT) stored in Hive. We want to get Date 'yyyy-MM-dd' for this epoch. Problem is my epoch is in milliseconds e.g. 1409535303522. So select timestamp, from_unixtime(timestamp,'yyyy-MM-dd') gives wrong results for…
Sourabh Potnis
  • 1,431
  • 1
  • 17
  • 26
19
votes
1 answer

Hive Select Into

I have a database people in hive. It's schema is as follows: name string, dob_date int, dob_month int, dob_year int. I have successfully loaded data from a file into the database. Now I want to have people having dob_year=1990 into a new table. The…
Nemil A Timbadia
  • 319
  • 1
  • 3
  • 6
19
votes
1 answer

R Hive Thrift Client

I'm working on adding HiveServer2 support to my company's R data-access package. I'm curious what the best way of generating an R Thrift client would be. I'm considering writing an R wrapper around the Java Thrift client, similar to what rhbase…
yoni
  • 5,686
  • 3
  • 27
  • 28