Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
25
votes
4 answers

Hive unable to manually set number of reducers

I have the following hive query: select count(distinct id) as total from mytable; which automatically spawns: 1408 Mappers 1 Reducer I need to manually set the number of reducers and I have tried the following: set mapred.reduce.tasks=50 set…
magicalo
  • 463
  • 2
  • 5
  • 12
25
votes
2 answers

Unable to use an existing Hive permanent UDF from Spark SQL

I have previously registered a UDF with hive. It is permanent not TEMPORARY. It works in beeline. CREATE FUNCTION normaliseURL AS 'com.example.hive.udfs.NormaliseURL' USING JAR 'hdfs://udfs/hive-udfs.jar'; I have spark configured to use the hive…
Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
25
votes
6 answers

What is the difference between -hivevar and -hiveconf?

From hive -h : --hiveconf Use value for given property --hivevar Variable subsitution to apply to hive commands. e.g. --hivevar A=B
Karnimrod
  • 425
  • 1
  • 4
  • 8
25
votes
1 answer

How to find out if a Hive table is external or internal?

I have multiple questions here. I am looking for any hive shell commands or queries to find the below details. Given a hive database name, how can I get the list of external tables in that database? Given a hive table name, how can I find out…
Sankar
  • 378
  • 1
  • 4
  • 8
25
votes
8 answers

How to know location about partition in hive?

If I write a hive sql like ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023; How can I query this location about partition later? Because I found there is some data in location but I can't query them,…
MoreFreeze
  • 2,856
  • 3
  • 24
  • 34
25
votes
3 answers

how to get input file name as column within hive query

I have a hive external tables that mapped to some directory. This directory includes a several files. I want to run query like find file name where there is a user "abc" select file_name , usr from usrs_tables where usr = "abc" But of course the…
Julias
  • 5,752
  • 17
  • 59
  • 84
25
votes
3 answers

How to delete/truncate tables from Hadoop-Hive?

Please tell me how to delete table from hive and also tell from where i can get more information about hive queries.
Amol Fasale
  • 942
  • 1
  • 10
  • 32
25
votes
4 answers

Hive: parsing JSON

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this? Here is an example: {"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}} I need these values out…
Don P
  • 60,113
  • 114
  • 300
  • 432
24
votes
6 answers

Does Hive have something equivalent to DUAL?

I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar?
jbreed
  • 1,514
  • 5
  • 22
  • 35
24
votes
7 answers

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table: SomeData_df.write.mode('overwrite').saveAsTable("SomeData") I get the following error: "Can not create the managed table('SomeData').…
24
votes
4 answers

Why is Spark saveAsTable with bucketBy creating thousands of files?

Context Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs: val spark = SparkSession.builder .appName("myApp") .config("hive.metastore.uris", "thrift://XXX.XXX.net:9083") …
astro_asz
  • 2,278
  • 3
  • 15
  • 31
24
votes
5 answers

Export as csv in beeline hive

I am trying to export my hive table as a csv in beeline hive. When I run the command !sql select * from database1 > /user/bob/output.csv it gives me syntax error. I have successfully connected to the database at this point using the below command.…
Defcon
  • 807
  • 3
  • 15
  • 36
24
votes
6 answers

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

I have saved a remote DB table in Hive using saveAsTable method, now when i try to access the Hive table data using CLI command select * from table_name, It's giving me the error below: 2016-06-15 10:49:36,866 WARN [HiveServer2-Handler-Pool:…
serverliving.com
  • 437
  • 1
  • 6
  • 16
24
votes
1 answer

Hive check elements in array

I have an Hive table made of user_id and item_id (id of items that have been purchased by the user). I want to get a list of all the users who purchased item 1 but not item 2 and 3. To do this I wrote the simple query: SELECT user_id,…
lucacerone
  • 9,859
  • 13
  • 52
  • 80
24
votes
3 answers

Connect from Java to Hive using JDBC

I'm trying to connect from Java to Hive server 1. I found a question time ago in this forum but it doesn't work for me. I'm using this code: import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import…
adrian
  • 317
  • 2
  • 4
  • 11