I'm trying to check the distribution of numbers in a column of a table. Rather than calculate on the entire table (which is large - tens of gigabytes) I want to estimate via repeated sampling. I think the typical Postgres method for this is
select COLUMN
from TABLE
order by RANDOM()
limit 1;
but this is slow for repeated sampling, especially since (I suspect) it manipulates the entire column each time I run it.
Is there a better way?
EDIT: Just to make sure I expressed it right, I want to do the following:
for(i in 1:numSamples)
draw 500 random rows
end
without having to reorder the entire massive table each time. Perhaps I could get all of the table row IDs and sample from it in R or something, and then just request those rows?