0

I am trying to find a way to split records for a test. I have a couple of columns that I need to consider like mailsegment and pack. I am wanting to get an even (or as close to) split at both levels and rolled up.

I have seen this solution (How do I select TOP 5 PERCENT from each group?) and applied it to my problem with reasonable success at the detailed level, however when I roll my segments up to a pack level it isn't even. This is due to mailsegments having an odd number of records in some cases. Is there a way that I can balance this out at a pack level (eg. only 2 packs) and still stratify my sample accurately across my mailsegment (200+ segments)?

Thanks in advance for the assistance.

Community
  • 1
  • 1
Dan
  • 2,625
  • 5
  • 27
  • 42

1 Answers1

1

What you want is a stratified sample. You can do this by using row_number() and taking the nth value:

select t.*
from (select t.*,
             row_number() over (partition by mailsegment, pack order by newid()) as seqnum
      from table t
     ) t
where seqnum % 20 = 1;

EDIT:

The above should work, but it is actually better to sort the data without partitioning.

For an update:

with toupdate as (
      select t.*,
             row_number() over (order by mailsegment, pack, newid()) as seqnum
      from table t
     )
update toupdate
    set myField = 'Test 1'
    where seqnum % 20 = 1;

This will work for additional test cells.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • So if I need to update `table t` I could use `update t set myfield = 'Test 1' from (...`. Would that be correct? Also not to be difficult, could this be further applied on an even 3 or 4 way split? – Dan Apr 21 '15 at 01:58