Even splitting of a group to within 1%

Question

I have been tasked with taking a group of customers and splitting them into two equal groups for each store location. The result set requested would have the two groups for each store location within 1% of each other on customer count, within 1% of each other on order count, and within 1% of each other on amount ordered.

Below is the code I came up with and it works fairly well and most of the time it gets the desired result but sometimes(I think due to an outlier in the group) the % will be further off than 1%.

If OBJECT_ID('tempdb.dbo.#Orders')  IS NOT NULL     DROP TABLE #Orders

Select
     StoreID
    ,CustomerID
    ,Sum(OrderID) as Orders
    ,Sum(OrderAmount) as AmountSold

Into #Orders

From CustomerOrders

Group by StoreID,CustomerID

IF OBJECT_ID('tempdb.dbo.#OrderRanking')    IS NOT NULL     DROP TABLE #OrderRanking

Select
     O.*
    ,ROW_NUMBER() Over(Partition by StoreID Order by AmountSold, Orders) as Ranking

Into #OrderRanking

From #Orders as O


Select
     R.StoreID
    ,Count(CustomerID) as CustomerCount
    ,Sum(R.Orders) as Orders
    ,Sum(R.AmountSold) as Amountsold
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End as 'Grouping'

From #OrderRanking as R

Group by
     R.StoreID
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End

Is there a better way to split the groups to ensure the 1% variance? Or maybe a way to loop through several different splits until it finds a 1%? If looping would need a fail safe to prevent infinite loop in case of impossible split something like after x loops just take closest split.

I am using SQL Server 2012 and SSMS 2016. Thanks for any help you can provide.

Edit: I had tried to convert the code into something not company specific I messed up the code. I realized that and adjusted the code to show what is truly being sought after.

Edit2: I made some progress on my own and wanted to update the question.

So I was working on this some more and I was able to get it to sort on a random order each time you run the code and have it display the Variance for each of the groups. Now all I want to add is a way to loop through X number times and keep the one that has lowest overall variance. This weekend I might try a few more things. But for now below is the new code I spoke of.

If OBJECT_ID('tempdb.dbo.#Orders')  IS NOT NULL     DROP TABLE #Orders

Select
     StoreID
    ,CustomerID
    ,Sum(OrderID) as Orders
    ,Sum(OrderAmount) as AmountSold
    ,Rand() as Random

Into #Orders

From CustomerOrders

Group by StoreID,CustomerID

IF OBJECT_ID('tempdb.dbo.#OrderRanking')    IS NOT NULL     DROP TABLE #OrderRanking

Select
     O.*
    ,ROW_NUMBER() Over(Partition by StoreID Order by Random) as Ranking

Into #OrderRanking

From #Orders as O


If OBJECT_ID('tempdb.dbo.#Split')   IS NOT NULL     DROP TABLE #Split

Select
     R.StoreID
    ,Count(CustomerID) as CustomerCount
    ,Sum(R.Orders) as Orders
    ,Sum(R.AmountSold) as Amountsold
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End as 'Grouping'

Into #Split

From #OrderRanking as R

Group by
     R.StoreID
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End


Select
     S.StoreID
     ,((Cast(Max(Case When S.[Grouping] = 'A' Then S.CustomerCount Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.CustomerCount Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.CustomerCount Else 0 End) as decimal(18,2)))*100 as CustomerCountVar
     ,((Cast(Max(Case When S.[Grouping] = 'A' Then S.Orders Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))*100 as OrderVar
    ,((Cast(Max(Case When S.[Grouping] = 'A' Then S.Amountsold Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))*100 as AmountsoldVar


From #Split as S

Group by S.StoreID

sounds like an odd requirement, is this just a technical exercise or is it a real world request? do you have a fixed data set to work with or is it changing over time? as you say, outliers in any dataset could skew the results away from 1%, i'm not sure you can guarantee that. — Tanner, Feb 02 '18 at 16:35
I agree that due to outliers the 1% is likely to not be guaranteed which is what I thought about the looping X times and picking closest to 1% idea. Just not sure how to code that. — Chad Portman, Feb 02 '18 at 16:59
I don't know if I fully understood your needs... But it might help to use `RANK` (probably better than `DENSE_RANK()` or even to create a separate ranking für `AmountSold` and `Orders` and take the average. — Shnugo, Feb 02 '18 at 17:07

score 1 · Accepted Answer · answered Feb 03 '18 at 03:27

So it is truly impossible to always get within 1% as we all expected but like I said we were ok with trying to get as close as possible after X number of tries. I have figured out how to make this happen. Below is the code I used currently set at 10 tries but can be changed to whatever number works for the business.

If OBJECT_ID('tempdb.dbo.#TestB')       IS NOT NULL     DROP TABLE #TestB
Create Table #TestB
(
 StoreID int
,CustomerID VarChar(11)
,Orders int
,AmountSold Float
,Random Float
,Ranking bigint
,CombinedVar Decimal(18,2)
)

If OBJECT_ID('tempdb.dbo.#BestPrep')        IS NOT NULL     DROP TABLE #BestPrep
Create Table #BestPrep
(
 StoreID int
,CustomerID VarChar(11)
,Orders int
,AmountSold Float
,Random Float
,Ranking bigint
,CombinedVar Decimal(18,2)
)


Declare @Giveup int

Set     @GiveUp = 10






WHILE @GiveUp > 0
BEGIN 


If OBJECT_ID('tempdb.dbo.#Orders')  IS NOT NULL     DROP TABLE #Orders

Select
     StoreID
    ,CustomerID
    ,Sum(OrderID) as Orders
    ,Sum(OrderAmount) as AmountSold
    ,Rand() as Random

Into #Orders

From CustomerOrders

Group by StoreID,CustomerID

IF OBJECT_ID('tempdb.dbo.#OrderRanking')    IS NOT NULL     DROP TABLE #OrderRanking

Select
     O.*
    ,ROW_NUMBER() Over(Partition by StoreID Order by Random) as Ranking

Into #OrderRanking

From #Orders as O


If OBJECT_ID('tempdb.dbo.#Split')   IS NOT NULL     DROP TABLE #Split

Select
     R.StoreID
    ,Count(CustomerID) as CustomerCount
    ,Sum(R.Orders) as Orders
    ,Sum(R.AmountSold) as Amountsold
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End as 'Grouping'

Into #Split

From #OrderRanking as R

Group by
     R.StoreID
    ,Case When Ranking%2 = 0 Then 'A' Else 'B' End

If OBJECT_ID('Tempdb.dbo.#Var')     IS NOT NULL     DROP TABLE #Var

Select
     S.StoreID

    ,ABS(((Cast(Max(Case When S.[Grouping] = 'A' Then S.CustomerCount Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.CustomerCount Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.CustomerCount Else 0 End) as decimal(18,2)))*100) as CustomerCountVar

    ,ABS(((Cast(Max(Case When S.[Grouping] = 'A' Then S.Orders Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))*100) as OrderVar

    ,ABS(((Cast(Max(Case When S.[Grouping] = 'A' Then S.Amountsold Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))*100) as AmountsoldVar

    ,ABS(((Cast(Max(Case When S.[Grouping] = 'A' Then S.Orders Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Orders Else 0 End) as decimal(18,2)))*100) 
    +
     ABS(((Cast(Max(Case When S.[Grouping] = 'A' Then S.Amountsold Else 0 End) as decimal(18,2))-Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))
        / Cast(Max(Case When S.[Grouping] = 'B' Then S.Amountsold Else 0 End) as decimal(18,2)))*100) as CombinedVar

INTO #Var    

From #Split as S

Group by S.StoreID

If Exists (Select * From #Var Where (OrderVar < 1 and AmountSoldVar <1) Or CombinedVar < 2)
If Object_ID('tempdb.dbo.#TestA')       IS NOT NULL     DROP TABLE #TestA

Select
 A.StoreID 
,A.CustomerID
,A.Orders
,A.AmountSold
,A.Random
,A.Ranking
,V.CombinedVar

Into #TestA

From #OrderRanking as A
    Join #var as V
        on A.StoreID = V.StoreID

Where A.StoreID in 
(Select StoreID From #Var Where (OrderVar < 1 and AmountSoldVar <1) Or CombinedVar < 2)


Insert Into #TestB

Select
 A.StoreID 
,A.CustomerID
,A.Orders
,A.AmountSold
,A.Random
,A.Ranking
,A.CombinedVar


From #TestA as A
    Left Join #TestB as B
        on      A.CustomerID = B.CustomerID

Where
        B.CustomerID is null


Insert Into #BestPrep

Select
 A.StoreID 
,A.CustomerID
,A.Orders
,A.AmountSold
,A.Random
,A.Ranking
,V.CombinedVar

From #OrderRanking as A
    Join #Var as V
        on A.StoreID = V.StoreID
    Left Join #BestPrep as B
        on      A.CustomerID = B.CustomerID
            and V.CombinedVar > B.CombinedVar

Where
        B.CustomerID is null


Set @Giveup = @Giveup-1



END




If Object_ID('tempdb.dbo.#bestPrep2')       IS NOT NULL DROP TABLE #bestPrep2
Select
     A.StoreID 
    ,Min(CombinedVar) as CombinedVar

Into #BestPrep2

From #BestPrep as A

Group by
     A.StoreID 

Select A.*

From #BestPrep as A
    Join #BestPrep2 as B
        on      A.StoreID = B.StoreID
            and A.CombinedVar = B.CombinedVar
Union
Select * From #TestB

Even splitting of a group to within 1%

1 Answers1