Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions
5
votes
1 answer

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

Can some experts give some succinct answers to the differences between Presto and Impala from these perspectives? Fundamental architecture design SQL compliance Real-world latency Any SPOF or fault-tolerance functionality Structured and…
Yellow Duck
  • 261
  • 1
  • 4
  • 14
5
votes
1 answer

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task. The task is to implement a feature of our product using Hadoop technology. Feature is quite simple, we have a UI which will let you insert details about "Network Issue". All details about such a issue are…
Ameya
  • 147
  • 2
  • 15
5
votes
4 answers

Error connecting: Could not connect to localhost:21000

I am trying to install cloudera impala on my local machine (32 bit ubuntu) without cloudera manager(they don't support on 32 bit ubuntu, i also tried and failed). I have tried following commands to download the impala from repository. $ sudo…
Naresh
  • 5,073
  • 12
  • 67
  • 124
5
votes
3 answers

split function does not work in Cloudera Impala

I keep getting an AnalysisException that says "split unknown" when I try to use the split function in Cloudera Impala. It seems to be a valid function listed on the built-in functions page. For reference, I'm using Hue to interact with Impala. Does…
Emre Colak
  • 814
  • 1
  • 9
  • 15
4
votes
5 answers

How to increase superset row limit and timeout cache for SQL Lab and Visualization

I have a dataset that has 1 billion rows. The data is stored in Hive. Also, I put Impala as a layer between Hive and Superset. The queries that are run in Superset have row limit max. 100.000. I need to change it with no row limit. Furthermore, I…
ufukyılmaz
  • 51
  • 1
  • 6
4
votes
2 answers

Impala add column with default value

I want to add a column to an existing impala table(and view) with a default value (so that the existing rows also have a value). The column should not allow null values. ALTER TABLE dbName.tblName ADD COLUMNS (id STRING NOT NULL '-1') I went…
user2441441
  • 1,237
  • 4
  • 24
  • 45
4
votes
1 answer

Compaction in Impala Tables

I want to know about the compaction in Impala tables but can't find material to study about. What are different techniques and where I can find material to study about it.
4
votes
1 answer

Override underlying parquet data seamlessly for impala table

I have an Impala table backed by parquet files which is used by another team. Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created) Our…
Kalaiselvam M
  • 1,050
  • 1
  • 16
  • 25
4
votes
1 answer

What is "cold start" in Hive and why doesn't Impala suffer from this?

I'm reading the literature on comparing Hive and Impala. Several sources state some version of the following "cold start" line: It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every…
DivyaJyoti Rajdev
  • 734
  • 1
  • 9
  • 15
4
votes
0 answers

Connect to impala using python from Windows machine. Error: 'TSocket' object has no attribute 'isOpen'

I want to access impala using python 3.7.3 (Anaconda, Jupyter Notebook) on my Windows machine. The following code I am trying to execute: from impala.dbapi import connect import traceback try: conn = connect(host='myhost.xx.yy', port=21050,…
clex
  • 465
  • 2
  • 7
  • 19
4
votes
1 answer

Consistent Hive and Impala Hash?

I am looking for a consistent way to hash something in both the Hive Query Language and the Impala Query Language where the hashing function produce the same value regardless of if it is done in Hive or in Impala. To clarify, I want something like…
Aur
  • 215
  • 2
  • 10
4
votes
0 answers

Select all except one impala

I am finding the apporoach to ignore a column from Inner-select Query in Impala . I am very well able to figure it out in Hive. Does anyone tried it in Impala ?? Hive : select `(col_name)?+.+` from t1 ; -- To Except a Column in Hive . Impala: I…
Govind
  • 419
  • 8
  • 25
4
votes
2 answers

Can not ALTER or DROP a big Imapa partitionned tables - CAUSED BY: MetaException: Timeout when executing

I have a several impala partitionned tables that have more than 50k partitions, it work a good except the Hive Metastore operations, like DROP and ALTER ... RENAME, I face this error message: Query: drop table cars ERROR: ImpalaRuntimeException:…
Mohammed Acharki
  • 234
  • 2
  • 14
4
votes
2 answers

Query to Show only column names in impala

In hive we can do "show columns in TABLE_NAME", to get only column name of a table.But I want a query to show only column names of a table in IMPALA.How can i get only the column names of a table in IMPALA?
Biswa Patra
  • 59
  • 1
  • 1
  • 8
4
votes
1 answer

(Hive, SQL) - How to sort a list of string inside a column?

I have a big data problem in Hive (SQL). SELECT genre, COUNT(*) AS unique_count FROM table_name GROUP BY genre which gives result like: genre | unique_count ---------------------------------- Romance,Crime,Drama,Law |…
Afloz
  • 3,625
  • 3
  • 25
  • 31