0

I am new to Impala having come from an Oracle SQL background. I am tasked with improving an existing SQL script from a performance point of view. The existing script includes the following in the where clause

colA > 0 and colB - colA > 10

I am fairly sure that the first part, colA > 0, is just there in an effort to prevent the second part being evaluated. I am of this opinion because, in the context that this script is being run, the evaluation of colB - colA > 10 can go ahead irrespective of whether colA > 0 or not.
Can someone confirm to me that neither AND nor OR are short-circuit evaluated in Impala? I would be really surprised if they are but I cannot find a definitive answer anywhere.

Salman A
  • 262,204
  • 82
  • 430
  • 521
AnneC
  • 129
  • 1
  • 10
  • Why would anyone add `colA > 0` to prevent the second part being evaluated? I'd never allow such things in my projects. – jarlh May 09 '23 at 10:53
  • Ours is not to reason why; ours is but to do and... well, make it run faster. – AnneC May 09 '23 at 11:28

2 Answers2

2

Its hard to find this information in the internet, free version of chatGPT is not sure so i think that we can take a look at source code and try to find something here

Lets start from sql-scanner (Lexer in jflex): sql-scanner.flex

public static void init(TReservedWordsVersion reservedWordsVersion) {
    // initilize keywords
    keywordMap = new LinkedHashMap<>();
    keywordMap.put("&&", SqlParserSymbols.KW_AND);
    keywordMap.put("add", SqlParserSymbols.KW_ADD);
    keywordMap.put("aggregate", SqlParserSymbols.KW_AGGREGATE);
    keywordMap.put("all", SqlParserSymbols.KW_ALL);
    keywordMap.put("alter", SqlParserSymbols.KW_ALTER);
    keywordMap.put("analytic", SqlParserSymbols.KW_ANALYTIC);
    keywordMap.put("and", SqlParserSymbols.KW_AND);

Here you can see that in Impala you can't use "&", the second thing is that "and" and "&&" are resolved to the same key word - KW_AND

It suggests that only short-circuit "and" is used in Impala but lets dig deeper

In sql parser i found this: sql-parser

compound_predicate ::=
  expr:e1 KW_AND expr:e2
  {: RESULT = new CompoundPredicate(CompoundPredicate.Operator.AND, e1, e2); :}
  | expr:e1 KW_OR expr:e2
  {: RESULT = new CompoundPredicate(CompoundPredicate.Operator.OR, e1, e2); :}
  | KW_NOT expr:e
  {: RESULT = new CompoundPredicate(CompoundPredicate.Operator.NOT, e, null); :}
  | NOT expr:e
  {: RESULT = new CompoundPredicate(CompoundPredicate.Operator.NOT, e, null); :}
  ;

Which shows that KW_AND with its operands is translated to CompoundPredicate(CompoundPredicate.Operator.AND, e1, e2)

In this file i found this: AndPredicate

// (<> && false) is false, (true && NULL) is NULL
BooleanVal AndPredicate::GetBooleanValInterpreted(
    ScalarExprEvaluator* eval, const TupleRow* row) const {
  DCHECK_EQ(children_.size(), 2);
  BooleanVal val1 = children_[0]->GetBooleanVal(eval, row);
  if (!val1.is_null && !val1.val) return BooleanVal(false); // short-circuit

  BooleanVal val2 = children_[1]->GetBooleanVal(eval, row);
  if (!val2.is_null && !val2.val) return BooleanVal(false);

  if (val1.is_null || val2.is_null) return BooleanVal::null();
  return BooleanVal(true);
}

Which probably confirms that it is short-circuit but i think that you should treat my finding with causion.

You can try to dig dipper if you want to see more, i think that my answer is good starting point :)

M_S
  • 2,863
  • 2
  • 2
  • 17
2

The SQL standard does not define if AND should short-circuit or not, it only defines the truth table.

This makes sense. Consider this example:

where person.gender = 'male' and person.age > 50

For short-circuiting the RDBMS must evaluate the AND condition from left to right... the order in which the programmer typed the code.

This would mean RDBMS cannot use its knowledge about the data to find a better order of evaluation (e.g. when there is an index on age column or when age is more selective than gender). If the where clause involves multiple tables then the decision becomes more difficult for the programmer:

where order.date = '2023-05-12' and customer.city = 'seattle'

Without short-circuiting, the RDBMS is in charge of finding the most optimal method to execute a query.

Salman A
  • 262,204
  • 82
  • 430
  • 521