Efficient repeated sampling with replacement of a table in a PostgreSQL alike?

Question

I'm trying to check the distribution of numbers in a column of a table. Rather than calculate on the entire table (which is large - tens of gigabytes) I want to estimate via repeated sampling. I think the typical Postgres method for this is

select COLUMN
from TABLE
order by RANDOM()
limit 1;

but this is slow for repeated sampling, especially since (I suspect) it manipulates the entire column each time I run it.

Is there a better way?

EDIT: Just to make sure I expressed it right, I want to do the following:

for(i in 1:numSamples)
  draw 500 random rows
end

without having to reorder the entire massive table each time. Perhaps I could get all of the table row IDs and sample from it in R or something, and then just request those rows?

In SQL2003 the tablesample clause where introduced (DB2 has implemented it, not sure if there are others). It has been discussed (https://wiki.postgresql.org/index.php?title=TABLESAMPLE_Implementation&action=history) for postgres, but I don't know the current status. — Lennart - Slava Ukraini, Jul 05 '14 at 04:52

score 0 · Answer 1 · answered Jul 05 '14 at 07:49

As you want a sample of the data, what about using the estimated size of the table and then calculate a percentage of that as the sample?

The table pg_class stores an estimate of the number of rows for each table (updated by the vacuum process if I'm not mistaken).

So the following would select 1% of all rows from that table:

with estimated_rows as (
  select reltuples as num_rows
  from pg_class t
    join pg_namespace n on n.oid = t.relnamespace
  where t.relname = 'some_table'
  and n.nspname = 'public'
)
select *
from some_table
limit (select 0.01 * num_rows from estimated_rows)
;

If you do that very often you might want to create a function so you could do something like this:

select *
from some_table
limit (select estimate_percent(0.01, 'public', 'some_table'))
;

But wouldn't it be the case that, in a 100 row table, a 50% sample would give me rows 1:50, and a 20% sample would give me rows 1:20? My objective here is to be able to do multiple independent draws. — Patrick McCarthy, Jul 05 '14 at 15:31
@PatrickMcCarthy as rows in a table are not sorted (especially if they are updated frequently) it should give you a "random" sample. But you are right: if you run it twice you will most probably get the same rows. But you _can_ still use the `order by random()` but it will make the query quite slow as it will always read the whole table. — , Jul 05 '14 at 16:01

Clodoaldo Neto · Answer 2 · 2014-07-05T17:46:34.653

Create a temporary table from the target table adding a row number column

drop table if exists temp_t;
create temporary table temp_t as
select *, (row_number() over())::int as rn
from t

Create a lighter temporary table by selecting only the columns that will be used in the sampling and filtering as necessary.

Index it by the row number column

create index temp_t_rn on temp_t(rn);
analyze temp_t;

Issue this query for each sample

with r as (
    select ceiling(random() * (select max(rn) from temp_t))::int as rn
    from generate_series(1, 500) s
)
select *
from temp_t
where rn in (select rn from r)

SQL Fiddle

Efficient repeated sampling with replacement of a table in a PostgreSQL alike?

2 Answers2