5

I have a table with a series of (IP varchar(15), DateTime datetime2) values. Each row corresponds to an HTTP request made by a user. I want to assign session numbers to these rows. Different IP-addresses have different session numbers. The same IP should be assigned a new session number if the last request is older than 30min. Here is a sample output:

IP,      DateTime,         SessionNumber, RequestNumber
1.1.1.1, 2012-01-01 00:01, 1,             1
1.1.1.1, 2012-01-01 00:02, 1,             2
1.1.1.1, 2012-01-01 00:03, 1,             3
1.1.1.2, 2012-01-01 00:04, 2,             1 --different IP => new session number
1.1.1.2, 2012-01-01 00:05, 2,             2
1.1.1.2, 2012-01-01 00:40, 3,             1 --same IP, but last request 35min ago (> 30min)

Columns 1 and 2 are inputs, 3 and 4 are the desired outputs. The table shows two users.

As the underlying is table is truely large, how can this be solved efficiently? I'd prefer a small constant amount of passes over the data (one or two).

Troy Alford
  • 26,660
  • 10
  • 64
  • 82
usr
  • 168,620
  • 35
  • 240
  • 369

2 Answers2

8

Here's a couple of tries at it.

;WITH CTE1 AS
(
SELECT *,
IIF(DATEDIFF(MINUTE,
       LAG(DateTime) OVER (PARTITION BY IP ORDER BY DateTime),
       DateTime) < 30,0,1) AS SessionFlag
FROM Sessions
), CTE2 AS
(
SELECT *,
       SUM(SessionFlag) OVER (PARTITION BY IP 
                                  ORDER BY DateTime) AS IPSessionNumber
FROM CTE1
)
SELECT IP,
       DateTime,
       DENSE_RANK() OVER (ORDER BY IP, IPSessionNumber) AS SessionNumber,
       ROW_NUMBER() OVER (PARTITION BY IP, IPSessionNumber 
                              ORDER BY DateTime) AS RequestNumber
FROM CTE2

This has two sort operations (by IP, DateTime then by IP, IPSessionNumber) but does assume that the SessionNumber can be assigned arbitrarily as long as a different unique session number is assigned to each new session per the ip address/ 30 minute rule.

To assign the SessionNumbers sequentially in chronological order. I used the following.

;WITH CTE1 AS
(
SELECT *,
IIF(DATEDIFF(MINUTE,
       LAG(DateTime) OVER (PARTITION BY IP ORDER BY DateTime),
       DateTime) < 30,0,1) AS SessionFlag
FROM Sessions
), CTE2 AS(
SELECT *,
       SUM(SessionFlag) OVER (ORDER BY DateTime) AS GlobalSessionNo
FROM CTE1
), CTE3 AS(
SELECT *,
       MAX(CASE WHEN SessionFlag = 1 THEN GlobalSessionNo END) 
               OVER (PARTITION BY IP ORDER BY DateTime) AS SessionNumber
FROM CTE2)
SELECT IP,
       DateTime,
       SessionNumber,
       ROW_NUMBER() OVER (PARTITION BY SessionNumber 
                              ORDER BY DateTime) AS RequestNumber
FROM CTE3

This increases the number of sort operations to 4 however.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • If requests from two IP's interleave, won't their sessions will mix up? – Andomar Apr 08 '12 at 19:17
  • I must say that T-SQL is lacking a custom, stateful aggregate feature. That would make this query easy and require only a single sort operation. – usr Apr 08 '12 at 22:33
  • 2
    @usr - Might be worth evaluating a cursor and a `#temp` table also! not sure if this is somewhere where [OVER clause enhancement request - Progressive ordered calculations](http://connect.microsoft.com/SQLServer/feedback/details/254397/over-clause-enhancement-request-progressive-ordered-calculations) would have helped. – Martin Smith Apr 08 '12 at 22:51
  • I actually did a dynamic-cursor solution while hoping that you would post a single-query solution (which you did!). It is 20x slower. Maybe I'll still use it because I can extend it better with more sophisticated calculations. The fundamental operation of what we are doing here is a "scan" (e.g. http://msdn.microsoft.com/en-us/library/hh211665(VS.103).aspx). scan = arbitrary aggregation over an ordered sequence while outputting one element per input element instead of one element for the whole group/sequence. – usr Apr 08 '12 at 23:00
2

Here is a version using a table variable and row_number to create a ID that can be used in a recursive CTE. It might be worthwhile to compare performance against the cursor and one query (provided by Martin) versions.

CREATE TABLE #T
(
  IP varchar(15),
  DateTime datetime,
  ID int,
  primary key (IP, ID)
)

insert into #T(IP, DateTime, ID)
select IP, DateTime, row_number() over(partition by IP order by DateTime) 
from #sessionRequests

;with C as
(
  select IP,
         ID,
         DateTime,
         1 as Session
  from #T
  where ID = 1
  union all 
  select T.IP,
         T.ID,
         T.DateTime,
         C.Session + case when datediff(minute, C.DateTime, T.DateTime) >= 30 then 1 else 0 end
  from #T as T
    inner join C 
      on T.IP = C.IP and
         T.ID = C.ID + 1
)
SELECT IP,
       DateTime,
       dense_rank() over(order by IP, Session) as SessionNumber,
       row_number() over(partition by IP, Session order by DateTime) as RequestNumber
from C
order by IP, DateTime, SessionNumber, RequestNumber
option (maxrecursion 0)
usr
  • 168,620
  • 35
  • 240
  • 369
Mikael Eriksson
  • 136,425
  • 22
  • 210
  • 281
  • 1
    I like this version because it is easy to extend, almost like the cursor based approach. I changed it to use a temp table which fixed an optimizer issue (table variables have no stats). Also, I verified, that this code works. Thanks! – usr Apr 09 '12 at 16:47