1

I have two tables. What I want is a random sample from all the possible pairings. Say size of t1 is 100, and size of t2 is 200, and I want a sample of 300 pairings. The naive way of doing this (ran on the online duckdb shell) is:

CREATE TABLE t1 as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' LIMIT 100;
CREATE TABLE t2 as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' LIMIT 200;
SELECT * FROM t1, t2 USING SAMPLE 300;

But an EXPLAIN reveals that this results in a full cross product, followed by a sample. For actually large tables this is intractable.

I tried reversing the order of operations, eg a shuffle followed by a limit, with

SELECT * FROM (FROM t1 ORDER BY RANDOM()) AS t1, (FROM t2 ORDER BY RANDOM()) AS t2 LIMIT 300;

and this is smarter, and is able to do it streaming according to EXPLAIN. But, this sampling of pairs is biased, because in duckdb's implementation of the cartesian product (and I'd guess most engines are the same, in fact I think it's required if you want to stream) it is like a nested for loop, where the first record in t1 is paired with all the records in t2, before moving on to the second record in t1, etc. So some records in t1 are "oversampled".

So here's my idea: let's sample as many records as we need from each table, in a random order, streamed, and re-sampling the same table over and over again as needed (because we need 300 pairs, and each source table is only length 100 or 200). Then pair these streams together and zip them up. In pseudocode:

WITH 
t1_generator AS (
   FROM t1 ORDER BY RANDOM()
   UNION ALL
   FROM t1 ORDER BY RANDOM()
   UNION ALL
   FROM t1 ORDER BY RANDOM()
   -- etc.
),
t2_generator AS (
   FROM t2 ORDER BY RANDOM()
   UNION ALL
   FROM t2 ORDER BY RANDOM()
   UNION ALL
   FROM t2 ORDER BY RANDOM()
   -- etc.
),
s1 AS (
   SELECT *, ROW_NUMBER() OVER () AS __rn FROM t1_generator
),
s2 AS (
   SELECT *, ROW_NUMBER() OVER () AS __rn FROM t2_generator
),
zipped AS (
   SELECT * EXCLUDE __rn FROM s1 JOIN s2 ON s1.__rn = s2.__rn
)
SELECT * FROM zipped
LIMIT 300

Except t1_generator and t2_generator are hardcoded in, and you need to calculate how many UNION ALLs you need, which is ugly. It also does the sampling WITHOUT replacement (ie round robin style, it goes through all of t1 before revisiting), when really I want WITH replacement. I tried stuff with recursive CTEs but those don't work to reshuffle the data on every "round".

Any thoughts on how to accomplish this? I'm using duckdb if that's important. Thanks!

Nick Crews
  • 837
  • 10
  • 13
  • that is how a cross join will always be implmented, qas you need a deterministic way, so a loop is the best way to do it. – nbk Mar 01 '23 at 19:14
  • I edited my original question with a more targeted question, maybe you can take a look again @nbk? I agree cross join isn't going to work, so I'm trying something else. – Nick Crews Mar 01 '23 at 21:51
  • why not sample 5 or what every you need from every table, add a row_number and join them. – nbk Mar 01 '23 at 21:57
  • Ah, yes that works just fine when the sample size is smaller than the number of rows in both tables, but say my tables are size 100 and 200, and I want a sample of 300 pairs, which is larger than both of the two source tables. So I need to re-sample the tables over and over again... I edited the original question so this is more clear. – Nick Crews Mar 02 '23 at 23:29
  • do you have an index on both tables? is it (aprox) 1..n and 1..m ? – Hogan Mar 02 '23 at 23:36
  • For that you would use a full outer Join, as you don't know which table is "bigger", but that is again slow. but as you want only a sample of the data and the algorithm will give you a good statistical sample, so there is no need to get all data – nbk Mar 02 '23 at 23:39

1 Answers1

2

I'm going to assume you have an index on both tables -- on table A it goes from 1 to N and on table B it goes form 1 to M

Then you just need to generate 300 pairs of random tuples (R(1..N), R(1..M)) Store that in a table and then join table A to the first column and table B to the 2nd column

With large tables this would be much faster than sampling from the cross join

Hogan
  • 69,564
  • 10
  • 76
  • 117
  • This would work, but still isn't streaming, since we have to materialize the pairs table. Still, probably more efficient since we just have to store two int columns. I think this is the best idea I've seen so far, but it would be better if it were streaming. – Nick Crews Mar 20 '23 at 20:18
  • I'm not sure what you mean by streaming – Hogan Mar 21 '23 at 22:11
  • Perhaps I don't understand either. But as I understand it, some SQL queries, such as LIMIT, don't materialize the entire result set, only the rows needed. With this implementation, we always materialize the 300 index pairs. But perhaps I'm getting ahead of myself, and this isn't actually a problem for my use case. – Nick Crews Mar 22 '23 at 00:15
  • Ok, that is exactly what I was addressing. You select the index numbers before hand and put them in a table when you do the join to these numbers because of the indexes only the elements you have in the table will be read. – Hogan Mar 22 '23 at 00:20