4

I need a sample after I make a join of two tables, something like:

SELECT  *
from 
A left join B 
on A.col=B.col
sample 1000000

The problem is that A and B are huge (over 3 billion rows) and I run out of spool space when I try a join.

Is there a way to do the join after the sample, so that it is joining smaller tables (for example chose 10,000,000 samples from A and from B, inner join them, and select 1,000,000 from the join, hoping I do get at least 1,000,000 rows?)

P.S. I am using teradata

Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
user
  • 2,015
  • 6
  • 22
  • 39
  • 1
    (1) What is the relation between A and B? 1:1? 1:N? (2) what are the primary indexes of A and B? – David דודו Markovitz Nov 07 '16 at 16:46
  • could you sample 1000000 A into a temporary table SELECT * INTO #A FROM A SAMPLE 1000000 then SELECT * FROM #A left join B on A.Col = B.col – Cato Nov 07 '16 at 16:47
  • You have already good answers here, but if you answer @Dudu-Markovitz question you'll allow the answers to be improved. – Insac Nov 08 '16 at 08:07
  • A and B are two tables but there is no strict 1:1 correspondence, however they have a very large intersection. I am only interested in sampling this intersection. – user Nov 08 '16 at 13:30
  • Are the columns used in the join, a primary index of one of the two tables? – Insac Nov 09 '16 at 21:29

4 Answers4

1

You can do as you proposed, apply the SAMPLE in a Derived Table:

SELECT  *
from 
 (
  SELECT * FROM A 
  SAMPLE 10000000
 ) AS A
left join B 
on A.col=B.col

Similar for an Inner Join

SELECT  *
from 
 (
  SELECT * FROM A
  SAMPLE 100000000 -- larger sample than needed 
 ) AS A
join B 
on A.col=B.col
sample 10000000
dnoeth
  • 59,503
  • 4
  • 39
  • 56
0

You could try joining subqueries using SAMPLE to limit the size:

SELECT *
FROM
(SELECT * FROM A SAMPLE 1000) t1
LEFT JOIN
(SELECT * FROM B SAMPLE 1000) t2
    ON t1.col = t2.col
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    (1) PO uses LEFT JOIN (2) statistically this query will return no rows – David דודו Markovitz Nov 07 '16 at 16:50
  • Isn't the point (2) from @DuduMarkovitz comment still valid after your edit? You take 1000 random rows from A (over 3 billions rows) and 1000 from B. The chances that you'll find some match (even if B is 1:1 with A) seem slim. – Insac Nov 08 '16 at 09:19
  • @Insac Why would the _probability_ of a matching join decrease with a decreased sample size? It should not. Hence, if most records would not match in the full join, then in my answer most records would also not match. The idea is to make the sample representative of the original data problem. – Tim Biegeleisen Nov 08 '16 at 09:23
  • I might be missing something here. What I got is that the OP wanted a sample of a large join and, since he hasn't enough spool, he would try to sample the two tables separately and then join them. In the initial query if A has a row with col=10 and B has a row with col=10, if this row is in the sample, it will appear in the result. In your query, to appear in the resultset , col=10 must be sampled independently both from A and from B, and this is quite unlikely ( 1/3000000^2, I'd assume) – Insac Nov 08 '16 at 09:30
  • I've tried to implement a small test (http://pastebin.com/9Xkcmx6g). If we were analyzing a self join, your approach would consistently work. However, since A and B are different tables, they seem to be sampled independently and this apparently introduce a matching issue that is not part of the original data problem. – Insac Nov 08 '16 at 10:02
  • In any case, sampling both tables will return wrong result for a left join. Given a record r from A and a record r' from B with the same col value, if r was sampled, r' must be also retrieved in order to create the correct result set. – David דודו Markovitz Nov 08 '16 at 11:07
0
SELECT * INTO #A FROM A SAMPLE 1000000;

followed by

SELECT * FROM #A left join B on #A.Col = B.col;

I'm bearing in mind that in your original query, you seem to be asking for any 1000000 of A, that will then LEFT JOIN to B where there is a batch, or return NULL for B if no match - I'm assuming it is a 1-1 or 1-0 join also - otherwise it doesn't match your original idea

Cato
  • 3,652
  • 9
  • 12
0
SELECT  *
from 
(select * from A sample 1000000) A left join B 
on A.col=B.col
Pang
  • 9,564
  • 146
  • 81
  • 122
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88