Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

Tools to enable easy data summarization (ETL)
Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
A mechanism to put structure on this data
An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

Scalability (scale out with more machines added dynamically to the Hadoop cluster)
Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
Fault-tolerance
Loose-coupling with its input formats
Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

Add clear textual problem description.
Provide query and/or table DDL if applicable
Provide exception message
Provide input and desired output data example
Questions about query performance should include EXPLAIN query output.
Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
Use proper code and text formatting

Official links:

Useful Links:

21846 questions

votes

4 answers

Hive unable to manually set number of reducers

I have the following hive query: select count(distinct id) as total from mytable; which automatically spawns: 1408 Mappers 1 Reducer I need to manually set the number of reducers and I have tried the following: set mapred.reduce.tasks=50 set…

hadoop mapreduce hive

asked Jan 06 '12 at 17:43

magicalo

votes

2 answers

Unable to use an existing Hive permanent UDF from Spark SQL

I have previously registered a UDF with hive. It is permanent not TEMPORARY. It works in beeline. CREATE FUNCTION normaliseURL AS 'com.example.hive.udfs.NormaliseURL' USING JAR 'hdfs://udfs/hive-udfs.jar'; I have spark configured to use the hive…

apache-spark hive apache-spark-sql udf

asked Aug 18 '16 at 16:52

Rob Cowie

22,259
6
62
56

votes

6 answers

What is the difference between -hivevar and -hiveconf?

From hive -h : --hiveconf Use value for given property --hivevar Variable subsitution to apply to hive commands. e.g. --hivevar A=B

hadoop hive hiveql

asked Jun 23 '16 at 16:23

Karnimrod

votes

1 answer

How to find out if a Hive table is external or internal?

I have multiple questions here. I am looking for any hive shell commands or queries to find the below details. Given a hive database name, how can I get the list of external tables in that database? Given a hive table name, how can I find out…

hadoop hive

asked Jul 14 '14 at 07:05

Sankar

votes

8 answers

How to know location about partition in hive?

If I write a hive sql like ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023; How can I query this location about partition later? Because I found there is some data in location but I can't query them,…

sql hadoop hive

asked Oct 24 '13 at 03:04

MoreFreeze

2,856
3
24
34

votes

3 answers

how to get input file name as column within hive query

I have a hive external tables that mapped to some directory. This directory includes a several files. I want to run query like find file name where there is a user "abc" select file_name , usr from usrs_tables where usr = "abc" But of course the…

hadoop hive

asked May 23 '13 at 13:54

Julias

5,752
17
59
84

votes

3 answers

How to delete/truncate tables from Hadoop-Hive?

Please tell me how to delete table from hive and also tell from where i can get more information about hive queries.

hadoop hive

asked Feb 28 '13 at 08:03

Amol Fasale

votes

4 answers

Hive: parsing JSON

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this? Here is an example: {"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}} I need these values out…

json hadoop hive

asked Sep 28 '12 at 18:44

Don P

60,113
114
300
432

votes

6 answers

Does Hive have something equivalent to DUAL?

I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar?

hadoop hive amazon-emr

asked Mar 20 '12 at 22:00

jbreed

1,514
5
22
35

votes

7 answers

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table: SomeData_df.write.mode('overwrite').saveAsTable("SomeData") I get the following error: "Can not create the managed table('SomeData').…

apache-spark hive azure-data-lake databricks azure-databricks

asked Mar 27 '19 at 15:04

BuahahaXD

votes

4 answers

Why is Spark saveAsTable with bucketBy creating thousands of files?

Context Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs: val spark = SparkSession.builder .appName("myApp") .config("hive.metastore.uris", "thrift://XXX.XXX.net:9083") …

apache-spark hive

asked Feb 02 '18 at 15:20

astro_asz

2,278
3
15
31

votes

5 answers

Export as csv in beeline hive

I am trying to export my hive table as a csv in beeline hive. When I run the command !sql select * from database1 > /user/bob/output.csv it gives me syntax error. I have successfully connected to the database at this point using the below command.…

csv hive beeline

asked Sep 19 '16 at 15:40

Defcon

votes

6 answers

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

I have saved a remote DB table in Hive using saveAsTable method, now when i try to access the Hive table data using CLI command select * from table_name, It's giving me the error below: 2016-06-15 10:49:36,866 WARN [HiveServer2-Handler-Pool:…

java hadoop apache-spark hive

asked Jun 15 '16 at 07:54

serverliving.com

votes

1 answer

Hive check elements in array

I have an Hive table made of user_id and item_id (id of items that have been purchased by the user). I want to get a list of all the users who purchased item 1 but not item 2 and 3. To do this I wrote the simple query: SELECT user_id,…

arrays select filter hive

asked Nov 06 '14 at 07:54

lucacerone

9,859
13
52
80

votes

3 answers

Connect from Java to Hive using JDBC

I'm trying to connect from Java to Hive server 1. I found a question time ago in this forum but it doesn't work for me. I'm using this code: import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import…

java hadoop jdbc hive

asked Aug 06 '14 at 09:54

adrian

Prev 1 2 3

…

99 100 Next