(Note: Not a duplicate of this question, since that question had a specific query in mind. This is more from a general, theory perspective.)
I am learning about databases in college, and I have learned that SQL databases, in order to execute a query, first translate it into relational algebra in order to form a plan of how to execute it. Relational algebra makes it quite easy to represent non-correlated SQL subqueries, which we can then combine into joins or set operations if needed. For instance, we can quite easily express
SELECT y FROM Table WHERE y NOT IN (SELECT x FROM AnotherTable);
as a set difference between two queries.
However, as far as I have learned, relational algebra provides no mechanism to express SQL's notion of correlated subqueries, or queries that capture columns from a parent query and thus must be executed multiple times, such as this example below:
SELECT employee_number, name
FROM employees AS emp
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = emp.department);
(In this example, the subquery in the outer WHERE
clause is correlated to the employee's department, meaning that the subquery must be run once for each employee in order to filter results by department.)
Like many other correlated subqueries, it is possible to decorrelate this subquery by instead using a join and a single aggregation query, which will allow the query to be perfectly expressible in RA and also cause the subquery to run only once:
SELECT emp.employee_number, emp.name
FROM employees AS emp
JOIN (
SELECT department AS department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department) AS salaries
ON emp.department = salaries.department
WHERE emp.salary > salaries.avg_salary;
Is it possible, however, to express all correlated subqueries as decorrelated subqueries, in order to express them as relational algebra, or are there some correlated subqueries that must be expressed as such? In other words, is SQL correlation simply a convenience feature that does not add anything to SQL's expressive power, or is RA merely an implementation guideline for which SQL is more expressive because of this feature?
If the former, what is the proof/algorithm of this? If the latter, is there a generally accepted form of RA that expresses correlation directly?