Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions
9
votes
1 answer

Impala command to know DB table size

Is there any way that we can check the DB table size and other properties ? I tried COMPUTE STATS but it gives the details of table except the size. any link to find information and other details are much appreciated.
Shantesh
  • 1,470
  • 1
  • 16
  • 26
9
votes
2 answers

Create table from CSV with values containing commas enclosed in quotes

I'm trying to create a table in Impala from a CSV that I've uploaded into an HDFS directory. The CSV contains values with commas enclosed inside quotes. Example: 1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO,…
nxl4
  • 714
  • 2
  • 8
  • 17
8
votes
2 answers

How do I set a variable in an Impala query using HUE?

I need to add parameters in several locations in a long query. I want to use parameters because I need to run the query multiple times with different values substituted in. This is very cumbersome because I need to replace the text in all locations…
OTM
  • 186
  • 3
  • 14
8
votes
4 answers

Will Spark SQL completely replace Apache Impala or Apache Hive?

I need to deploy Big Data Cluster on our servers. But I just know about knowledge of Apache Spark. Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive. I need your help. Thanks.
Tim Koo
  • 111
  • 1
  • 4
8
votes
1 answer

Running impala cluster from portable binaries

I'm evaluating multiple big data tools. One of them is of course Impala. I would like to start Impala cluster by manually starting processes on the cluster nodes. As I'm currently doing for Spark, H2O, Presto and Dask, I would like to grab binaries,…
jangorecki
  • 16,384
  • 4
  • 79
  • 160
7
votes
2 answers

How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation?

Usually in Impala, we use the COMPRESSION_CODEC before inserting data into a table for which the underlying files are in Parquet format. Commands used to set COMPRESSION_CODEC: set compression_codec=snappy; set compression_codec=gzip; Is it…
Gomz
  • 850
  • 7
  • 17
7
votes
2 answers

extract the date from a timestamp value variable in Impala

How can I extract the date from a timestamp value variable in Impala? eg time = 2018-04-11 16:05:19 should be 2018-04-11
Anna
  • 444
  • 1
  • 5
  • 23
7
votes
1 answer

Difference in days between two dates in Impala

I am trying to find a date difference In Impala. I have tried a few options. my most recent is below ABS(dayofyear(CAST(firstdate AS TIMESTAMP)-dayofyear(CAST(seconddate AS TIMESTAMP) an example of data looks like: firstDate: 2017-11-25 …
burnsa9
  • 131
  • 2
  • 3
  • 10
7
votes
1 answer

Dropping multiple partitions in Impala/Hive

1- I'm trying to delete multiple partitions at once, but struggling to do it with either Impala or Hive. I tried the following query, with and without ': ALTER TABLE cz_prd_corrti_st.s1mme_transstats_info DROP IF EXISTS PARTITION…
k_mishap
  • 451
  • 2
  • 8
  • 17
7
votes
2 answers

Impala: Show tables like query

I am working with Impala and fetching the list of tables from the database with some pattern like below. Assume i have a Database bank, and tables under this database are like…
Manindar
  • 999
  • 2
  • 14
  • 30
7
votes
2 answers

How to duplicate cloudera impala table with impala-shell or other means?

I see a table "test" in Impala when I do show tables; I want to make a copy of the "test" table so that it is an exact duplicate, but named "test_copy". Is there a impala query I can execute to do this? If not, how can I do this?
Rolando
  • 58,640
  • 98
  • 266
  • 407
7
votes
4 answers

ROW_NUMBER( ) OVER in impala

I have a use case where I need to use ROW_NUMBER() over PARTITION: Something like: SELECT Column1 , Column 2 ROW_NUMBER() OVER ( PARTITION BY ACCOUNT_NUM ORDER BY FREQ, MAN, MODEL) as LEVEL FROM TEST_TABLE I need a workaround for this…
user1189851
  • 4,861
  • 15
  • 47
  • 69
7
votes
3 answers

Get sequential number of a row (rank) within a partition without using ROW_NUMBER() OVER function

I need to rank rows by partition (or group), i.e. if my source table is: NAME PRICE ---- ----- AAA 1.59 AAA 2.00 AAA 0.75 BBB 3.48 BBB 2.19 BBB 0.99 BBB 2.50 I would like to get target table: RANK NAME PRICE ---- ---- ----- 1 AAA 0.75 2 …
Andrey Dmitriev
  • 528
  • 2
  • 9
  • 27
7
votes
2 answers

Uploading CSV for Impala

I am trying to upload the csv file on HDFS for Impala and failing many time. Not sure what is wrong here as I have followed the guide. And the csv is also on HDFS. CREATE EXTERNAL TABLE gc_imp ( asd INT, …
LonelySoul
  • 1,212
  • 5
  • 18
  • 45
7
votes
1 answer

Impala cannot find com.mysql.jdbc.Driver

I'm trying to set up Cloudera Impala with CDH4 in pseudo distributed mode on Red Hat 5. I have Hive using JDBC to connect to a MySQL metastore, but I'm having trouble setting up Impala with JDBC. I've been following the instructions found here:…
supermaria
  • 121
  • 1
  • 5