Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
1 answer

Hive Sort Merge Bucket Join

Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join? Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB…
Bagavathi
  • 438
  • 2
  • 7
  • 17
4
votes
1 answer

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load…
wuchang
  • 3,003
  • 8
  • 42
  • 66
4
votes
2 answers

How to remove correlated subquery in my subselect (due to presto limitation)

I am trying to find the distinct count of users from the US that have performed a specific action (any row in p.action) within a 14 day rolling window, over the past ~2 months. Here's the query. I'd love some tips on how I can re-write this to not…
Frank
  • 45
  • 1
  • 7
4
votes
0 answers

Insert overwrite statement runs much slower in spark-sql than it does in hive-client

spark version: 2.0.0 hive version: 2.0.1 I find insert overwrite statement running in spark-sql or spark-shell spends much more time than it does in hive-client (i start it in apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes…
snodawn
  • 51
  • 1
  • 4
4
votes
2 answers

How to get lastaltertimestamp from Hive table?

Teradata has the concept of lastaltertimestamp, which is the last time an alter table command was executed on a table. lastaltertimestamp can be queried. Does Hive have a similar value that can be queried? The timestamp returned by hdfs dfs -ls…
user1332148
  • 1,256
  • 2
  • 11
  • 24
4
votes
1 answer

drop partition dynamically from HIVE table

I have a hive table where partition exists on one of the date column, but date column is stored as INT in the format YYYYMMDD. The table can also contain data for future date's partitions. Now as part of the process I want to drop partitions which…
Koushik Chandra
  • 1,565
  • 12
  • 37
  • 73
4
votes
2 answers

Create a date table in hive

How to create a table in hive which should have all the dates from 1st Jan 2016 till today (01-01-2016 to 12-10-2016)? The table would have only one column i.e. the date column. Thanks.
ravi.j
  • 41
  • 1
  • 2
4
votes
1 answer

Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR)

I am getting LeaseExpiredException in hadoop cluster - tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-172-30-2-148.log 2016-09-21 11:54:14,533 INFO BlockStateChange (IPC Server handler 10 on 8020): BLOCK* InvalidateBlocks: add…
devsda
  • 4,112
  • 9
  • 50
  • 87
4
votes
1 answer

output hive query result as csv enclosed in quotes

I have to export data from a hive table in a csv file in which fields are enclosed in double quotes. So far I am able to generate a csv without quotes using the following query INSERT OVERWRITE DIRECTORY '/user/vikas/output' ROW FORMAT DELIMITED…
Vikas Saxena
  • 1,073
  • 1
  • 12
  • 21
4
votes
2 answers

How to load data to same Hive table if file has different number of columns

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and…
Raj
  • 537
  • 4
  • 9
  • 18
4
votes
1 answer

Job cancelled because SparkContext was shut down while saving dataframe as hive table

I am trying to save a dataframe to hive table. Simplified code looks like this: DataFrame df = hiveContext.sql(); df.write().mode("overwrite").saveAsTable("schemaName.tableName"); Here is the error which I am getting while running this…
vatsal mevada
  • 5,148
  • 7
  • 39
  • 68
4
votes
1 answer

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I want to use my local code connect to the remote hive by spark-sql . this is my code : package src.main.scala import org.apache.spark.{SparkConf, SparkContext} import com.datastax.spark.connector.cql.CassandraConnector import…
Kof
  • 65
  • 2
  • 5
4
votes
2 answers

Hive 'limit' in subquery executes after full query

I’m testing a rather taxing rlike function in a hive query. I figured I’d test against a subset first before applying it to my TB+ of data. The full query is: create table proxy_parsed_clean as select a.*, case when domainname rlike…
TayTay
  • 6,882
  • 4
  • 44
  • 65
4
votes
1 answer

Impala : argument of type 'NoneType' is not iterable

I have imported a table from MySQL which has 10 million rows to Hive and now performing some operations in Impala to check the functionality and performance. Now I am getting an error argument of type 'NoneType' is not iterable when I issue the…
Happy Coder
  • 4,255
  • 13
  • 75
  • 152
4
votes
1 answer

Extract all characters before a period with HiveQL regex?

I have a table that looks like: bl.ah foo.bar bar.fight And I'd like to use HiveQL's regexp_extract to return bl foo bar
Danny David Leybzon
  • 670
  • 1
  • 9
  • 21