Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
28
votes
2 answers

How to get array/bag of elements from Hive group by operator?

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I…
Anuroop
  • 993
  • 3
  • 13
  • 25
28
votes
9 answers

Writing to HDFS could only be replicated to 0 nodes instead of minReplication (=1)

I have 3 data nodes running, while running a job i am getting the following given below error , java.io.IOException: File /user/ashsshar/olhcache/loaderMap9b663bd9 could only be replicated to 0 nodes instead of minReplication (=1). There are 3…
Ashish Sharma
  • 1,597
  • 7
  • 24
  • 35
27
votes
3 answers

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR, and need to copy some files to all EMR instances. One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to…
Tomer
  • 859
  • 3
  • 11
  • 19
27
votes
5 answers

how to convert unix epoch time to date string in hive

I have a log file which contains timestamp column. The timestamp is in unix epoch time format. I want to create a partition based on a timestamp with partitions year, month and day. So far I have done this but it is throwing an error. PARSE ERROR…
priyank
  • 4,634
  • 11
  • 45
  • 52
27
votes
3 answers

Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only…
Madhava Carrillo
  • 3,998
  • 3
  • 18
  • 24
27
votes
6 answers

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table. My hope is that in the answers to this…
Sim
  • 13,147
  • 9
  • 66
  • 95
27
votes
5 answers

Books to start learning big data

I would like to start learning about the big data technologies. I want to work in this area in the future. Does anyone know good books to start learning about it? Hadoop, HBase. Beginner - intermediate - advanced - Thanks in advance
Gunter Amorim
  • 77
  • 1
  • 5
  • 14
26
votes
6 answers

Hive getting top n records in group by query

I have following table in hive user-id, user-name, user-address,clicks,impressions,page-id,page-name I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name] I understand that we need to first group…
TopCoder
  • 4,206
  • 19
  • 52
  • 64
26
votes
8 answers

Is there a Hive equivalent of SQL "not like"

While Hive supports positive like queries: ex. select * from table_name where column_name like 'root~%'; Hive Does not support negative like queries: ex. select * from table_name where column_name not like 'root~%'; Does anyone know an…
CMaury
  • 1,273
  • 5
  • 13
  • 25
26
votes
3 answers

Showing tables from specific database with Pyspark and Hive

Having some databases and tables in them in Hive instance. I'd like to show tables for some specific database (let's say 3_db). +------------------+--+ |  database_name   | +------------------+--+ | 1_db            | | 2_db            | | 3_db  …
Keithx
  • 2,994
  • 15
  • 42
  • 71
26
votes
2 answers

How can I convert array to string in hive sql?

I want to convert an array to string in hive. I want to collect_set array values to convert to string without [[""]]. select actor, collect_set(date) as grpdate from actor_table group by actor; so that [["2016-07-01", "2016-07-02"]] would become…
Bethlee
  • 825
  • 3
  • 17
  • 28
26
votes
3 answers

How to control partition size in Spark SQL

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no…
nagendra
  • 1,885
  • 3
  • 17
  • 27
26
votes
4 answers

Can we load Parquet file into Hive directly?

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused. Simply put, I have a parquet file - say users.parquet. Now I am struck here…
annunarcist
  • 1,637
  • 3
  • 20
  • 42
26
votes
4 answers

How to copy all hive table from one Database to other Database

I have default db in hive table which contains 80 tables . I have created one more database and I want to copy all the tables from default DB to new Databases. Is there any way I can copy from One DB to Other DB, without creating individual…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
26
votes
5 answers

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66