2

I want to randomly sample some subset of my data, using a random seed so that it will be repeatable. Currently, I have this working minus the seed:

select * from my_table
where version is not null
  and start_datetime::date >= date('2020-03-16')
  and start_datetime::date < date('2020-05-15')
order by random()
limit 10000

Now I'd like to set a random seed so I can reliably get the same results from this query.

Is there a good way to do this?

Peter
  • 12,274
  • 9
  • 71
  • 86

2 Answers2

2

One option uses random function setseed(). As explained in the documentation: once this function is called, the results of subsequent random() calls in the current session are repeatable by re-issuing setseed() with the same argument.

The technique is to include the call to the function directly in your query, using union all, and then sort in the outer query. This requires listing the columns that you want to return from the query. Assuming that you want columns col1, col2, col3, your query would be:

select * 
from (
    select setseed(0.5) colx, null col1, null col2, null col3
    union all
    select null, col1, col2, col3
    from mytable
    where 
        rpt.assist_detail.version is not null
        and start_datetime::date >= date '2020-03-16'
        and start_datetime::date <  date '2020-05-15'
    offset 1
) t
order by random()
limit 10000

offset 1 is there to drop the row generated by the first subquery. The argument of setseed() (here 0.5) may be any arbitrary value between -1 and 1. As long as you pass the same value, you get the same sort.

GMB
  • 216,147
  • 25
  • 84
  • 135
-1

RANDOM() is non-repeatable by definition.

If you want to get the same ordering based on a random value again, you have no choice but to store the random values, once obtained, somewhere.

I suggest you create your own table, adding an integer column, which you fill with a random integer:

CREATE TABLE my_random_ordered_sample AS
SELECT
  (RANDOM()*10000)::INT AS rand_ord
, *
FROM mytable
WHERE rpt.assist_detail.version IS NOT NULL
  AND start_datetime::DATE >= '2020-03-16'::DATE
  AND start_datetime::DATE <  '2020-05-15'::DATE
;

Once you have it, you can:

SELECT
  all_the
, columns_
, except_the
, ordering_column
, named_rand_ord
FROM my random_ordered_sample
ORDER BY rand_ord;
marcothesane
  • 6,192
  • 1
  • 11
  • 21