SQL: aggregate sizes into bins

Question

I have a (Sybase) table that has the following information:

 order_id   int   
 timestamp  datetime   
 action     char(1)      --i=inserted, c=corrected, r=removed   
 shares      int

It keeps track of the shares associated with an order (identified by its order_id) in a system. Using an example, the life of an order is as follows:

  timestamp action  shares      
  10:00:00  i       1000     -- initial Insert    
  10:06:30  c       900      -- one Change    
  10:07:12  c       800    
  10:50:20  r       800      -- Removal    
  11:10:10  i       600      -- 2nd Insert    
  11:12:10  r       600

In the example above, the order is active from 10:00:00 and 10:50:20, and again from 11:10:10 and 11:12:10

I have 1000s of such orders in the system and I need to plot with a histogram how many shares are active in a time series divided into bins/buckets of 5 minutes. If the number of shares for a given order changes more than once within the samenter code heree bin, I need to average the shares; as in the example above in the 10:05-10:10 bin where 1000, 900 and 800 can be avg'd out as 900.

Here's a more complex example:

1, "20140828 10:00:00",  "i", 1000
1, "20140828 10:06:00",  "c",  900
1, "20140828 10:07:12",  "c",  500
1, "20140828 10:10:10",  "c",  400
1, "20140828 10:20:20",  "r",  400
1, "20140828 10:30:10",  "i",  300
1, "20140828 10:32:10",  "r",  300

2, "20140828 09:51:00",  "i",  500
2, "20140828 10:08:30",  "r",  500

3, "20140828 10:10:00",  "i", 1000
3, "20140828 10:11:20",  "r", 1000

with its expected output:

10:00:00 1500
10:05:00 1300
10:10:00 1450
10:15:00 400
10:20:00 400
10:25:00 0
10:30:00 300
10:35:00 0
10:40:00 0
10:45:00 0
10:50:00 0
10:55:00 0

Thanks in advance for your help.

Can you post your expecting output result of your example input data? — Jaugar Chang, Aug 28 '14 at 03:39
Thanks for the suggestion, @JaugarChang. I changed my posting a bit and added a more comprehensive example and expected output. — Jerome Provensal, Aug 28 '14 at 19:40

Pieter Geerkens · Accepted Answer · 2014-08-28T03:36:44.947

This is a variation on the Running Sum problem in SQL Server (either MS or Sybase, due to shared history), grouped by the bucket ID, which can simply be the time difference in minutes from a base time integer-divided by 5. So something like this will do:

create table #t(
    BucketNo    int not null primary key clustered,
    Activity    int not null,
    Active      int not null
);

-- pre-aggregate activity data
-- assumes prior existence of a zero-based NUMBERS or TALLY table
insert #t(BucketNo,Activity,Active) 
select 
     N
    ,isnull(Activity,0)
    ,0
from NUMBERS 
left join (
    select
         datediff(mm,0,TimeStamp) / 5 as BucketNo
        ,case action when 'i' then +1
                          'r' then -1
         end * shares          as Activity
        ,0 as Active  
    from  ActivityTable
    where action <> 'c'
    group by            datediff(mm,0,TimeStamp) / 5 

    union all

    select
         datediff(mm,0,TimeStamp) / 5 as BucketNo
        ,case action when 'i' then +1
                          'r' then -1
         end * shares
         - (  select top 1 i.shares 
              from ActivityTable i
              where i.order_id = c.order_id  and  i.TimeStamp > c.TimeStamp
              order by i.TimeStamp desc
           ) as Activity
        ,0 as Active  
    from ActivityTable as c
    where c.action  = 'c         
    group by            datediff(mm,0,TimeStamp) / 5 
) data on data.BucketNo = N
where N < 24 * 12; -- 5 minute buckets per day

Now we use the SQL Server quirky update to process #t in clustered-index order to perform the Running Sum.

declare @Shares   int = 0,
        @BucketNo int = 0;

-- `quirky update` peculiar to SQL Server
update #t
   set @Shares = Shares 
               = case when BucketNo = @BucketNo
                      then @Shares + Activity
                      else 0
                 end,
       @BucketNo = BucketNo
from #t with (TABLOCKX) -- not strictly necessary when using a temp table.
option (MAXDOP 1);      -- prevent parallelization of query

select BucketNo, Active from #t order by BucketNo
go

Thanks Pieter, I'm trying your solution (with some modification for Sybase) but I have 3 questions: 1) what's that NUMBERS/TALLY table that you are talking about, 2) I'm getting a syntax error that I can't resolve around the `select top 1 i.shares` line: `Incorrect syntax near the keyword 'top'. Msg 102, Level 15, State 181`, 3) will the quirky update work in Sybase? — Jerome Provensal, Aug 28 '14 at 15:12
@jeromeso: **Numbers table**: http://dba.stackexchange.com/questions/11506/why-are-numbers-tables-invaluable and http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/you-require-a-numbers-table.aspx. Yes, quirky update should still work in SYBASE, as it pre-dates the separation of SYBASE and MS flavours of SQL Server. re the syntax error, try **FIRST** instead of **TOP 1** (as per http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00801.1510/html/iqrefso/X315771.htm) — Pieter Geerkens, Aug 28 '14 at 20:03
After playing with the SQL solution I realized that it was slower than my Perl implementation and so I continued with Perl. Thanks for the suggestion, @PieterGeedkens! I learn quite a bit. — Jerome Provensal, Aug 29 '14 at 18:06

SQL: aggregate sizes into bins

1 Answers1