Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions
5
votes
0 answers

How to use Impala to read Hive view containing complex types?

I have some data that is processed and model based on case classes, and the classes can also have other case classes in them, so the final table has complex data, struct, array. Using the case class I save the data in hive using…
Shikkou
  • 545
  • 7
  • 22
5
votes
2 answers

Hive/Impala performance with string partition key vs Integer partition key

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?
5
votes
1 answer

Select definite rows in Impala

Let's say, that I need to select rows from 3 to 10. For MySQL I would use limit For ORACLE I would use rownum Does Impala allow to select definite rows via such simple method? Thanks in advance.
5
votes
0 answers

select all but few columns in impala

Is there a way to replicate the below in impala? SET hive.support.quoted.identifiers=none INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal' Basically I have a table in…
redhands
  • 349
  • 5
  • 14
5
votes
4 answers

How to load Impala table directly to Spark using JDBC?

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and…
alfredox
  • 4,082
  • 6
  • 21
  • 29
5
votes
3 answers

Invalidate metadata/refresh imapala from spark code

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table. Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my…
Havnar
  • 2,558
  • 7
  • 33
  • 62
5
votes
3 answers

Multiple Full Outer Joins

I want to use the result of a FULL OUTER JOIN as a table to FULL OUTER JOIN on another table. What is the syntax that I should be using? For eg: T1, T2, T3 are my tables with columns id, name. I need something like: T1 FULL OUTER JOIN T2 on…
Putt
  • 299
  • 4
  • 10
5
votes
2 answers

[Simba][ImpalaJDBCDriver](500051) ERROR processing query/statement

I'm getting the following error while executing queries against a database in impala. With other databases its working fine. Error trace is as follows. [Simba][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: select * from…
smali
  • 4,687
  • 7
  • 38
  • 60
5
votes
3 answers

Cloudera Impala INVALIDATE METADATA

As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA or REFRESH command to inform impala about…
masoumeh
  • 468
  • 1
  • 5
  • 15
5
votes
5 answers

What is the best query to sample from Impala for a huge database?

I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?
Soroosh
  • 477
  • 2
  • 7
  • 18
5
votes
1 answer

Does Impala makes effective use of Buckets in a Hive Bucketed table?

I'm in the process of improving the performance of a table. Say this table: CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) COMMENT 'A bucketed copy of user_info' PARTITIONED BY(Year int, month int) STORED AS…
5
votes
1 answer

install cloudera impala shell on mac os x and connect to impala cluster

We have impala server on prod and I need connect to it with impala shell from my local macbook w/ mac os x (10.8). I downloaded Impala-cdh5.1.0-release.tar.gz, unarchived it, tried buildall.sh which failed: .../bin/impala-config.sh: line 123: nproc:…
yetanothercoder
  • 1,689
  • 4
  • 21
  • 43
5
votes
2 answers

Impala - file not found error

I'm using impala with flume as filestream. The problem is flume is adding temporary files with extension .tmp, and then when they are deleted impala queries are failing with the following message: Backend 0:Failed to open HDFS file …
griffon vulture
  • 6,594
  • 6
  • 36
  • 57
5
votes
3 answers

Load large csv in hadoop via Hue would only store a 64MB block

Im using the Cloudera quickstart vm 5.1.0-1 Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore…
bobo32
  • 992
  • 2
  • 9
  • 21
5
votes
2 answers

Impala on Hadoop 2.2.0 without CDH?

I want to test and configure Impala with my Hadoop 2.2.0 distribution, not Cloudera ones. I want to know if its possible to use Impala without CDH, because I only read that Impala is CDH dependent. I'm trying to follow the guide in Impala Github -…
BAndrade
  • 107
  • 1
  • 8