Is it possible to decorrelate all correlated SQL subqueries?

Question

(Note: Not a duplicate of this question, since that question had a specific query in mind. This is more from a general, theory perspective.)

I am learning about databases in college, and I have learned that SQL databases, in order to execute a query, first translate it into relational algebra in order to form a plan of how to execute it. Relational algebra makes it quite easy to represent non-correlated SQL subqueries, which we can then combine into joins or set operations if needed. For instance, we can quite easily express

SELECT y FROM Table WHERE y NOT IN (SELECT x FROM AnotherTable);

as a set difference between two queries.

However, as far as I have learned, relational algebra provides no mechanism to express SQL's notion of correlated subqueries, or queries that capture columns from a parent query and thus must be executed multiple times, such as this example below:

SELECT employee_number, name
  FROM employees AS emp
  WHERE salary > (
    SELECT AVG(salary)
      FROM employees
      WHERE department = emp.department);

(In this example, the subquery in the outer WHERE clause is correlated to the employee's department, meaning that the subquery must be run once for each employee in order to filter results by department.)

Like many other correlated subqueries, it is possible to decorrelate this subquery by instead using a join and a single aggregation query, which will allow the query to be perfectly expressible in RA and also cause the subquery to run only once:

SELECT emp.employee_number, emp.name
  FROM employees AS emp
  JOIN (
    SELECT department AS department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department) AS salaries
    ON emp.department = salaries.department
  WHERE emp.salary > salaries.avg_salary;

Is it possible, however, to express all correlated subqueries as decorrelated subqueries, in order to express them as relational algebra, or are there some correlated subqueries that must be expressed as such? In other words, is SQL correlation simply a convenience feature that does not add anything to SQL's expressive power, or is RA merely an implementation guideline for which SQL is more expressive because of this feature?

If the former, what is the proof/algorithm of this? If the latter, is there a generally accepted form of RA that expresses correlation directly?

score 2 · Accepted Answer · edited Mar 30 '19 at 19:36

First of all, it is possible to express correlated sub queries (-> dependent joins) in relational algebra and also aggregates.

You might be interested in this: http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf

I haven't read the entire paper, but I attended a class with Professor Neumann. He claimed that you can decorrelate arbitrary queries. However there are some limitations I believe.

select *
from T1
where T1.a = (select T2.a from T2 where T2.b = T1.b)

In principle it would be easy to decorrelate this query and I after looking at the query plan I believe the database system at their chair is able to do that (https://hyper-db.de/interface.html). I think you could not express that in SQL though, since here you get a runtime error if the subquery does not return a scalar (https://blogs.msdn.microsoft.com/craigfr/2006/09/27/scalar-subqueries/).

It is easy to map equality of a scalar subquery to, say, IN then a JOIN. Also: In his first paper on a non-algebraic query language in the style of predicate logic Codd showed how to map such expressions to a relational algebra. Since a (suitable strong) relational algebra can express any (suitably strong) function of a database state, SQL can be mapped to it. — philipxy, Feb 11 '19 at 23:51

Erwin Smout · Answer 2 · 2018-05-20T15:26:08.680

1

I want to remark here that you may be suffering from a misunderstanding of when exactly a sub-query is "correlated".

In my understanding, a sub-query is "correlated" if AND ONLY IF it references columns from a containing (/outer) query. Your SELECT AVG(...) query does not fit that description. It is what is called a scalar subquery. Because it does nothing but compute a scalar value from some table. And it can also "stand on its own".

As for the actual question, any difficulty in pinning down an RA notation for THIS PARTICULAR query, is due to the fact that the query involves an aggregation, and that RA notations typically do not support denoting such an operation. If they did, a JOIN of the original table with your SELECT AVG(...) plus the needed RESTRICT on that JOIN, would solve your issue.

edited May 20 '18 at 15:26

answered May 20 '18 at 15:21

Erwin Smout

18,113
4
33
52

Thank you for your clarification. I will note here that I do decorrelate the SELECT AVG() query successfully in the question using a JOIN. What I am asking about is the general case of being able to express *any* correlated query as a decorrelated RA query, or if RA directly supports the notion of correlation. – TheHans255 May 20 '18 at 16:14
@TheHansinator Relational algebras don't have a notion of "correlation" because they don't have bound/quantified/logic variables. However, suitably rearranged expressions correspond to algebra expressions. See Codd's ALPHA paper for an example of this. This is painful to demonstrate in SQL only because its syntax & semantics is atrocious. – philipxy Feb 11 '19 at 23:54

Is it possible to decorrelate all correlated SQL subqueries?

2 Answers2