SQL Statistical sampling

Question

I'm looking for some genius SQL help with a tricky statistical problem I'm having.

What I'm looking to do is pull a statistically balanced sample out of an unbalanced group of user profiles. Doing this for a single profile attribute (e.g. gender) at a time would be somewhat straightforward. But doing it across multiple dimensions at once requires some sophistication.

For argument's sake, let's say I have this table.

Profile.userID  
Profile.Gender  
Profile.Age  
Profile.Income

If I want to pull a pool of profiles out of the mix so that the new sampling of users roughly matches all of the following characteristics:

50% male, 50% female
30% young, 40% middle age, 40% old
40% low income, 40% middle income, 20% high income

Does anyone have any ideas about how to pull this off?

What prevents you from randomly pulling records one at a time until the sample set meets your specifications? — Robert Harvey, Oct 17 '12 at 17:10
How would I keep it from continually getting out of balance? Say I need just one more female record, but pulling that one then tilts my age and income out of balance...? — tbacos, Oct 17 '12 at 17:27
30% young, 40% middle age, 40% old != 100% Is there an overlap between young and middle aged in your range? — Bryan Allo, Oct 17 '12 at 17:40
sorry - that's just my bad math in the example. it should read 30,40,30 — tbacos, Oct 17 '12 at 20:29

score 3 · Answer 1 · answered Oct 17 '12 at 18:06

What you have is a sampling problem. They key to solving this problem is to break the data into separate groups of the combinations of the three variables. Then, calculate the product of the marginal probabilities of each group (your values are marginal probabilities). Then, normalize these over all 18 groups.

For instance, the Male-Young-Low group would get a value of 0.5*0.3*0.4 = 0.06. You repeat this for all 18 groups and then normalize to a percentage (that is, divide each value by the sum of all the values). Here is the result:

Gender  Age     Income  Marg    Normalized
Male    Young   Low     0.06    5.5%
Male    Young   Middle  0.06    5.5%
Male    Young   High    0.03    2.7%
Male    Middle  Low     0.08    7.3%
Male    Middle  Middle  0.08    7.3%
Male    Middle  High    0.04    3.6%
Male    Old     Low     0.08    7.3%
Male    Old     Middle  0.08    7.3%
Male    Old     High    0.04    3.6%
Female  Young   Low     0.06    5.5%
Female  Young   Middle  0.06    5.5%
Female  Young   High    0.03    2.7%
Female  Middle  Low     0.08    7.3%
Female  Middle  Middle  0.08    7.3%
Female  Middle  High    0.04    3.6%
Female  Old     Low     0.08    7.3%
Female  Old     Middle  0.08    7.3%
Female  Old     High    0.04    3.6%

This then becomes your sample rate for each group. Here is pseudo SQL code for actually doing the sampling:

with SamplingRates (
    select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate,
    union all . . 
)
select t.*
from (select t.*,
            row_number() over (partition by gender, age, income order by <random>) as seqnum,
            count(*) over (partition by gender, age, income) as NumRecs
      from table t
     ) t join
     SampleRates sr
     on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and
        seqnum <= sr.SamplingRate * NumRecs

Bryan Allo · Answer 2 · 2012-10-17T19:19:44.807

Here's how I would go about it, Assuming: 30% young, 40% middle age, 30% old

Taking least common denominators, your Pool size = 5x5x3x4x2x4 = 2400

You have 18 queries to populate your pool into a TEMP TABLE. Repeat all 18 queries to give you a larger pool. Below is an idea of what the distribution of an ideal pool would look like and what each query would look like. You can also introduce some randomness into each query. There was a previous post about doing this.

This is perhaps less elegant but should yield a balanced pool nonetheless.

Your first query in pseudocode would look like:

SELECT * INTO TEMP TABLE 
WHERE male, young, high income and ID NOT IN TEMP TABLE 
LIMIT RECORD SET 72

And so on and so forth. Hope it helps. Good question though.

CREATE TEMP TABLE
480 high income
    144 young
        72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
        72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
    192 middle age
        96 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
        96 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
    144 old
        72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
        72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]

960 middle income
    288 young
        144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
    384 middle age 
        192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
        192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
    288 old
        144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]

960 low income
    288 young
        144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
    384 middle age
        192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
        192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
    288 old
        144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]

SQL Statistical sampling

2 Answers2