2

I came across this journal: http://users.dcc.uchile.cl/~cgutierr/cursos/BD/divisionSQL.pdf that gave a method of translating an relational algebra division into sql. I am a little skeptical if this actually works,

Given T1 (A,B) and T2(B)

 SELECT A FROM T1 
 WHERE B IN ( SELECT B FROM T2 ) 
 GROUP BY A 
 HAVING COUNT(*) = ( SELECT COUNT (*) FROM T2 );

because suppose you were given a row in T1[A,B] such that there was duplicates that satisfied the statement

where B in (select B from T2), wouldn't that cause the having clause to skip that row? or does the "in" statement remove duplicates?

user
  • 854
  • 2
  • 12
  • 28
  • They simply assume that data is unique :-) If `B` is not unique simply change `COUNT(*)` to `COUNT(DISTINCT B)` – dnoeth Oct 27 '15 at 17:24
  • "given a row in T1[A,B] such that there was duplicates that satisfied the statement" is not clear. Multiple rows in what table that match what row from what table? Please use more words and say what you mean clearly. – philipxy Jan 06 '17 at 22:35

2 Answers2

1

No. Relational algebra division takes a set of relations (A,B), and returns the set of A such that there is a relation between A and all members of a given subset of Bs. For example, in the paper you linked, a1 and a3 are the only As that have a relation to b2 and b3. The line WHERE B IN ( SELECT B FROM T2 ) filters the rows of T1 so that only those rows with b2 or b3 are in the B column. This is equivalent to an INNER JOIN between T1 and T2 on their respective Bs.

Note that there are no duplicate entries in T1 or T2. A few equivalent queries (plus the equivalent in the journal, though note their caveat that these queries return an empty set if T2 is empty, unlike the paper's Q1):

 SELECT A FROM T1 
 WHERE EXISTS(SELECT 1 FROM T2 WHERE T2.B = T1.B) 
 GROUP BY A 
 HAVING COUNT(*) = ( SELECT COUNT (*) FROM T2 );

 SELECT A FROM T1 
 INNER JOIN T2 ON T1.B = T2.B
 GROUP BY A 
 HAVING COUNT(*) = ( SELECT COUNT (*) FROM T2 );
Adam Martin
  • 1,188
  • 1
  • 11
  • 24
0

Your question is unclear. But you seem to be asking the following of that query Q0: When a row of T1 matches multiple/duplicate rows of T2 will using COUNT equality filter it out of the result and so not give the right answer?

If there are no duplicates of the row in T1 [sic] then yes it will be filtered out. (Otherwise it might or might not be.) But there are no duplicate rows in relational algebra, so this is all moot.

For a version handling duplicates in SQL you have to decide what you want "divide by" to mean when there are duplicate rows. You can reuse an algebra definition while taking the names of other algebra operators to be similarly named SQL/bag operators, but you get different results depending on what definition you start with.

Anyway Q0 does not return a division. Division can return rows of T1 whose B does not appear in T2; Q0 cannot. Division returns rows AB of T1 where A appears in T1 with every row of T2 and some row of T2. Q0 returns rows AB of T1 where A appears in T1 with every row of T2 and only rows of T2.

In fact relational division, Q0 and Q1 ask for three different sets of rows. Q1 returns rows AB of T1 where A appears in T1 with every row of T2.

The paper is poorly written:

  • Q0 does not calculate relational division.

  • The paper calls query Q1 the "classical version" of division but Q1 is not division. (The paper acknowledges this via "the code for Q0 and Q1 produce two different results" yet contradicts itself by calling Q1 an "implementation of the division operator" and Q0 "an alternative implementation".)

PS: SELECT removes duplicates if and only if DISTINCT is specified. IN tells you whether a row is in a table. Asking whether a row is IN the DISTINCT version is always the same as asking whether it is in the non-DISTINCT version. Assuming that we want rows WHERE subrows are IN a subselect, there's no need to use DISTINCT.

philipxy
  • 14,867
  • 6
  • 39
  • 83