Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions
0
votes
1 answer

Impala - First date of Month from string date

I have column data_as_of_daily_date (Data type String) in my source table (staging) and I need to find out First date of Month based on source table column in Impala and load it into target table having column FIRST_DAY_OF_MONTH (String…
A Saraf
  • 315
  • 4
  • 20
0
votes
2 answers

Impala - Find first day on Month from String value

I have data_date column (String data type) in table employee having value in YYYYMMDD format. Please suggest solution to find first day of month based on data_date column. for example : data_Date - 20181217 (String Value) Output - 20181201 (String…
A Saraf
  • 315
  • 4
  • 20
0
votes
1 answer

What is wrong with my Group By statement?

I have a SQL Group By statement where I want to find the distinct substationcode and substationname with record count. With the correct Group By, I should be able to see records that have count for distinct substationcode + substationname…
B.Dick
  • 305
  • 2
  • 11
0
votes
0 answers

Why this, well working, query against Impala throws error via ODBC driver?

I have the query below which runs perfectly in Hue but fails via ODBC from C#. The ODBC driver says something about unknown parameters, but I could not figure out what is that. In the logs the ^ points to the question mark following the LIMIT…
AndrasCsanyi
  • 3,943
  • 8
  • 45
  • 77
0
votes
1 answer

How to batch inserts via Impala ODBC?

I have been querying and inserting data from and to Impala via ODBC, but it is slow (at least compared to a Postgres or SQL Server) and ODBC driver makes possible to execute queries one by one, which is absolutely not recommended as every insert…
AndrasCsanyi
  • 3,943
  • 8
  • 45
  • 77
0
votes
1 answer

Is the syntax for a regular expression different between Hive and Impala?

The following regexp_extract function appears to work in Impala, but does not work when I use it in Hive: select regexp_extract("efwe FR wefwef", '.*?([[:upper:]]+).*?', 1) The result in Impala is FR (as I would expect, i.e. the upper case…
Tom1281
  • 11
  • 3
0
votes
0 answers

what is the best practice from Cloudera to migrate the parquet-based impala to kudu-based impala

We are using Cloudera as our hadoop environment. Can someone please provide any guildance on how to integrate or migrate existing parquet/impala to kudu/impala to hopefully get a performance improvement to our existing pipeline? Our existing…
mdivk
  • 3,545
  • 8
  • 53
  • 91
0
votes
2 answers

How to set Python environment variable in Windows? or any other node package available for node impala connection?

While Trying to do the setup node in windows. I needed to install one node package called jdbc to connect with impala. after running npm install jdbc giving the error as Error: Can't find Python executable "C:\Program Files\Python30\", you can set…
Thilak Raj
  • 880
  • 3
  • 10
  • 25
0
votes
1 answer

int_months_between for weeks in impala?

What would be the best solution for int_months_betweenin weeks for Impala? Would I have to work with Invervals or what is the best recommendation.
Anna
  • 444
  • 1
  • 5
  • 23
0
votes
1 answer

creating and inserting data from a R dataframe to Cloudera Impala with DBI package

I have created a couples of tables (data frames) in R that I need to upload to Cloudera Impala, I am using DBI package to connect with Impala. So I have for example: df<-data.frame(x) How do I insert df into Impala as a table? I have seen that this…
L.Gut
  • 1
  • 1
0
votes
1 answer

is there a way to optimize this query's performance in Impala?

This query involves 4 tables and cost 10.5 hours to complete: Step1: create table temp partitioned by (date_pull) stored as parquet as select from trans_ext -- this is the base table inner join [shuffle] ac -- fact_acc inner join [shuffle]…
mdivk
  • 3,545
  • 8
  • 53
  • 91
0
votes
1 answer

Oozie Sqoop Workflow Refresh table

I update impala-tables by querying though workflow that created in Oozie Editor. (But who cares? Just "I update tables". And, at the end of workflow, you need to run "refresh ". But I don't know how to do it. I need non-bash method. Does Oozie can…
0
votes
2 answers

Error while compiling statement: FAILED: ParseException line 3:0 missing ALL at 'select' near '' line 5:0 missing ALL at 'select' near ''

I am running the following query in Impala select count(id) from (select s_id as id, m_id from hur_e_s_amer union select s_id, m_id from hur_e_s_emea union select r_id, m_id from hur_e_r_amer union select r_id, m_id from hur_e_r_emea ) t1 join…
Taylrl
  • 3,601
  • 6
  • 33
  • 44
0
votes
1 answer

Querying across months and days

My access logs database stores time as epoch and extracts year month and day as integers. Further, the partitioning of the database is based on the extracted Y/m/d and I have a 35 day retention. If I run this query: select * from mydb where…
mikernova
  • 55
  • 5
0
votes
1 answer

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala. I want to load data from the HDFS to my datalake using impala. So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the…
Henry Navarro
  • 943
  • 8
  • 34
1 2 3
99
100