Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
4
votes
0 answers

Zeppelin Spark interpreter throw java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod

I have a cluster with hadoop 2.2, spark 2.1.1, hive 2.1.1, Zeppelin 0.7.2 In zeppelin spark paragraph, I executed with %spark 1+1 Excepions came in the following logs. How it comes? Any ideas? INFO [2017-06-23 06:26:40,727] ({pool-2-thread-4} …
lex
  • 41
  • 2
4
votes
1 answer

What is the most efficient way to create new Spark Tables or Data Frames in Sparklyr?

Using the sparklyr package on a Hadoop cluster (not a VM), I'm working with several types of tables that need to be joined, filtered, etc... and I'm trying to determine what would be the most efficient way to use the dplyr commands along with the…
quickreaction
  • 675
  • 5
  • 17
4
votes
2 answers

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Consider the following example dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z')) # A tibble: 2 x 1 mydate 1 2011-03-01T00:00:04.226Z 2…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
4
votes
4 answers

Improving performance of hive jdbc

Does aynyone know how to increase performance for HIVE JDBC connection. Detailed problem: When I query hive from Hive CLI, I get a response within 7 sec but from HIVE JDBC connection I get a response after 14 sec. I was wondering if there is any way…
techprat
  • 375
  • 7
  • 23
4
votes
2 answers

Pyhive, SASL and Python 3.5

I tried to set a hive connection as described here: How to Access Hive via Python? using the hive. Connection with python 3.5.2 (installed on a cloudera Linux BDA) but the SASL package seems to cause a problem. I saw on a forum that SASL is…
Thomas Bury
  • 138
  • 1
  • 2
  • 8
4
votes
1 answer

sparklyr can't see databases created in Hive and vice versa

I installed Apache Hive in local and I was trying to read tables via Rstudio/sparklyr. I created a database using Hive: hive> CREATE DATABASE test; and I was trying to read that database using the following R…
stochazesthai
  • 617
  • 1
  • 7
  • 20
4
votes
1 answer

Hive error : Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable

I am facing an error when I try to query a table in hive when hive uses spark. For example, when I do: select count(*) from ma_table; I get this: Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable at…
Flibidi
  • 153
  • 2
  • 12
4
votes
1 answer

Hive equivalent to Spark Vector on table creation

I have a Spark DataFrame with one of the columns as Vector type. When I create a hive table on top of it, I don't know which type it is equivalent to CREATE EXTERNAL TABLE mix ( topicdist ARRAY ) STORED AS PARQUET LOCATION…
DeanLa
  • 1,871
  • 3
  • 21
  • 37
4
votes
2 answers

external hive metastore issue in EMR cluster

I am pointing my EMR cluster's hive metastore to exteral MySQL RDS instance. I have created new hive database "mydb" and I got the entry in external MySQL DB in hive.DBS table. hdfs://ip-10-239-1-118.ec2.internal:8020/user/hive/warehouse/mydb.db …
sam
  • 85
  • 3
  • 10
4
votes
1 answer

Memory allocation issue in writing Spark DataFrame to Hive table

I am trying to save a Spark DataFrame to a Hive table (Parquet) with .saveAsTable() in pySpark, but keep running in to memory issues like below: org.apache.hadoop.hive.ql.metadata.HiveException: parquet.hadoop.MemoryManager$1: New Memory allocation…
vk1011
  • 7,011
  • 6
  • 26
  • 42
4
votes
1 answer

Read data from Hadoop HDFS with SparkSQL connector to visualize it in Superset?

on a Ubuntu server I set up Divolte Collector to gather clickstream data from websites. The data is being stored in Hadoop HDFS (Avro files). (http://divolte.io/) Then I would like to visualize the data with Airbnb Superset which has several…
4
votes
1 answer

Hive Create table - When to use VARCHAR and STRING as column data type

I am trying to create a HIVE table. I am not sure when we use VARCHAR and when we use String. If we use VARCHAR then do we have to define length like we define in RDBMS as VARCHAR(10) Please help
v83rahul
  • 283
  • 2
  • 7
  • 20
4
votes
2 answers

Hive query select one column depending on another column during group by

There are similar questions out there, but the solution of them can't quite solve my problem. Consider the following table: id type time 1 a 1 1 a 2 1 b 3 2 b 1 2 b 2 What I want is the id with the smallest time and the type…
wwood
  • 489
  • 6
  • 19
4
votes
1 answer

How to alias a column with identifier that contains spaces?

Does anyone know the syntax for aliasing a column without underscores in Hive? In SQL and MYSQL you can use single quotes or brackets. This does not seem to work in Hive. Here a simple query that wouldn't work: select inbound_handled as 'IB…
M Jennings
  • 41
  • 1
  • 2
4
votes
1 answer

Aggregate strings in group by and ordered in Hive and Presto

I have a table in the following format: IDX IDY Time Text idx1 idy1 t1 text1 idx1 idy2 t2 text2 idx1 idy2 t3 text3 idx1 idy1 t4 text4 idx2 idy3 t5 text5 idx2 idy3 t6 text6 idx2 idy1 t7 text7 idx2 idy3 t8 text8 What I'd like to see is something like…
Nick
  • 367
  • 4
  • 6
  • 13