SQL Sampling based on the whole population

Question

I have a population of records...let's say 10,000 athletes, grouped by sports, where (numbers below would be variable):

4,000 are from NBA
2,000 are from NHL
3,000 are from MLB 
1,000 are from NFL

How can I build a sample query that will sample 100 records based on the population, not fully random but pull out:

NBA/Whole Population=X Select Top X * From MainTable Where league= 'NBA' (something like this)

40 names are from NBA 20 names are from NHL 30 names are from MLB 10 names are from NFL.

This is just a sample of the population, logic here is to calculate what the ratios are with regard to the whole population and then apply them to the sample size.

Regards

Possible duplicate of [SQL random sample with groups](http://stackoverflow.com/questions/30235542/sql-random-sample-with-groups) — Peter B, Jan 20 '17 at 15:54

score 0 · Answer 1 · answered Jan 20 '17 at 20:34

Dim Leagues(1 To 4) As String

Leagues(1) = "NHL"
Leagues(2) = "MLB"
Leagues(3) = "NFL"
Leagues(4) = "MLS"


Set db = CurrentDb

For x = 1 To 4

y = 0
sqql = "Select * from Maintable Where League = '" & leagues(x) & "'"

Set cf = db.OpenRecordset(sqql)

Set samp = db.OpenRecordset("RANDOMSAMPLE")






Do While y < (x * 1000) ' adjust as necessary just swagged in you wanted 1000 from league 1, 2000 league 2 etc

cf.MoveLast
cf.MoveFirst

i = Int((cf.RecordCount - 1 + 1) * Rnd + 1)

cf.Move (i)

With samp

.AddNew

.fields("Yourfield here") = cf![your field ]

' repeat as nec

.Update

End With
y = y + 1
Loop


cf.Close
Next x

samp.Close

score 0 · Accepted Answer · answered Jan 20 '17 at 23:26

Consider using a count correlated subquery for a rank order that you then use as filtering criteria for sample ratio.

SELECT main.*
FROM 
   (SELECT *,
         (SELECT Count(*) FROM MainTable sub 
          WHERE sub.League = t.League AND sub.UniqueID <= t.UniqueID) As Rank
    FROM MainTable t) AS main

WHERE main.Rank <= CInt((SELECT Count(*) FROM MainTable sub
                         WHERE sub.League = main.League) / 
                         (SELECT Count(*) FROM MainTable) * 100)
ORDER BY main.League, main.Rank

To explain above query with nested subqueries and derived tables:

The derived table, main, is exact source MainTable with a new column called Rank that gives an ordinal count of records for each League. So for the first NBA record (not necessarily first row), it is tagged rank 1, next NBA record (which can appear anywhere like 89th row) is tagged 2, and so on for each League. And yes, Rank will go up to 4,000 if needed!
Once this Rank field is calculated giving ordinal 1, 2, 3, ... indicators for each League grouping, we then position this SELECT statement as a derived table in FROM clause in order to use Rank in WHERE filter for the sample ratio. We cannot calculate a column and filter in same SELECT call.
Sample ratio is the last two subqueries used for a quotient that calculates: (# of League records matching current row / total # of table records). This value is then multiplied by 100 per sample quota. CInt is used to return integer values of possible decimal ratios. Consider also Round(..., 0) which rounds instead of strips decimal points.

Parfait...is there a way to randomize the rank? It seems like it ranks the records the same way....what if I wanted to get different results every time I run this query? — Lev Dashev, Jan 31 '17 at 17:35
Good question. Consider asking as a new one as the answer may involve application layer code (VBA, PHP, Python, Java, etc.) above database layer. SQL in the end is only a special-purpose, declarative language where columns/variables once declared are immutable. — Parfait, Feb 02 '17 at 01:06

SQL Sampling based on the whole population

2 Answers2