SQL random sample with groups

Question

I have a university graduate database and would like to extract a random sample of data of around 1000 records.

I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg

enter image description here

I could do this using the following:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.

Any help would be greatly appreciated

How would that make sure i get the correct proportions in the sample? — Simon, May 14 '15 at 11:01
How do you calcualte the sample size? Is it based on population percentage? — Paresh J, May 14 '15 at 11:02
The sample would be used for a questionnaire so the size of the sample depends on how much budget we have..not very scientific I know! — Simon, May 14 '15 at 11:07

score 29 · Accepted Answer · answered May 14 '15 at 11:03

29

You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

answered May 14 '15 at 11:03

Gordon Linoff

1,242,037
58
646
786

1

Brilliant, thanks. I've just done a quick check with my data and the biggest difference between the sample and the population proportions was 0.748% which is perfectly acceptable. – Simon May 14 '15 at 11:21
["Stratified sampling"](https://en.wikipedia.org/wiki/Stratified_sampling) – Clint Pachl Aug 30 '20 at 20:55
What if you wanted to guarantee exactly a certain number of rows (still proportional, stratified sampling), in this case 1,000, but in my use-case 20? – JD136 Jan 07 '22 at 23:43

shA.t · Answer 2 · 2015-05-14T11:13:20.820

1

Add a table for storing population.

I think it should be like this:

SELECT *
FROM (
    SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
    FROM degree) t
    LEFT OUTER JOIN
    population p ON t.coursecode = p.coursecode
WHERE
    rn <= p.SampleSize

edited May 14 '15 at 11:13

answered May 14 '15 at 11:00

shA.t

16,580
5
54
111

score 1 · Answer 3 · answered May 14 '15 at 11:31

It is not necessary to partition the population at all.

If you are taking a sample of 1000 from a population among hundreds of course codes, it stands to reason that many of those course codes will not be selected at all in any one sampling.

If the population is uniform (say, a continuous sequence of student IDs), a uniformly-distributed sample will automatically be representative of population weighting by course code. Since newid() is a uniform random sampler, you're good to go out of the box.

The only wrinkle that you might encounter is if a student ID is a associated with multiple course codes. In this case make a unique list (temporary table or subquery) containing a sequential id, student id and course code, sample the sequential id from it, grouping by student id to remove duplicates.

score 0 · Answer 4 · answered May 14 '15 at 11:03

I've done similar queries (but not on MS SQL) using a ROW_NUMBER approach:

select ...
from 
 ( select ...
     ,row_number() over (partition by coursecode order by newid()) as rn
   from degree
 ) as d 
join sample size as s
on d.coursecode = s.coursecode
and d.rn <= s.samplesize

SQL random sample with groups

4 Answers4

Linked