Group By, Count and Delete on Consecutive Records

Question

I have a tricky SQL question. This is based on SQL server 2008 R2.

From a Log table, I have to combine consecutive records which have the same messages (MSG), count how many messages are combined (COUNT), and then delete the duplicate messages. This also needs to be done within a date range, so that any records outside of that range is left alone.

To make this more understandable, here is a small example of the data:

ID  DATE       MSG  COUNT
1   2013-08-17 mail NULL
2   2013-08-17 mail NULL
3   2013-08-17 www  NULL
4   2013-08-18 www  NULL
5   2013-08-18 www  NULL
6   2013-08-18 www  NULL
7   2013-08-18 mail NULL
8   2013-08-18 www  NULL
9   2013-08-19 mail NULL
10  2013-08-19 mail NULL
11  2013-08-20 mail NULL
12  2013-08-20 mail NULL
13  2013-08-21 www  NULL
14  2013-08-22 mail NULL
15  2013-08-22 mail NULL
16  2013-08-23 mail NULL
17  2013-08-23 mail NULL
18  2013-08-23 mail NULL

The result should look like the followng:

ID  DATE       MSG  COUNT
1   2013-08-17 mail NULL
2   2013-08-17 mail NULL
3   2013-08-17 www  NULL
6   2013-08-18 www  3
7   2013-08-18 mail 1
8   2013-08-18 www  1
12  2013-08-20 mail 4
13  2013-08-21 www  1
15  2013-08-22 mail 2
16  2013-08-23 mail NULL
17  2013-08-23 mail NULL
18  2013-08-23 mail NULL

So, basically, the query should

handle data only within a given date range (in this example from 2013-08-18 to 2013-08-22)
combine consecutive rows based on the text of the MSG field
count the combined data and set the value in the COUNT field
delete the duplicate records (in this example e.g ID 6 stays, but ID 5 and ID 4 should be deleted)

As I am not an expert in SQL, I would really appreciate any help, suggestions or SQL queries.

Can you make a SQLFiddle? It would be much more easier for us to help you. — Minoru, Sep 17 '13 at 13:08
Are you looking for consecutive records or records on the same date, since in the query above it seems the records (MSG column) are grouped on the basis of date. — Sunil Tandon, Sep 17 '13 at 13:14
I am looking for consecutive records with same MSG (Message), not Date. Sorry for the confusion — padre, Sep 17 '13 at 13:19
Which version of SQL Server are you running. SQL Server 2010 is not a known version — podiluska, Sep 17 '13 at 13:21

Minoru · Answer 1 · 2013-09-18T14:28:30.663

1

My idea is to do it with 2 queries:

(i) The first one was to only count and update the records.

(ii) The second one was to delete all the records given the date range that had a NULL value on COUNT column.

EDIT: I did the step (i), but I couldn't make it keep the COUNT value NULL for the ones to be deleted. It updates all rows with the COUNT. Now you just have to DELETE the right rows.

Step (i):

(For MySQL)

UPDATE tab ta JOIN 
    (SELECT date, msg, COUNT(*) AS cnt FROM tab GROUP BY date, msg) tb 
SET ta.count = tb.cnt 
WHERE ta.date = tb.date AND ta.msg = tb.msg AND 
ta.date BETWEEN 
    DATE('2013-08-18') AND DATE('2013-08-21');

PS: The syntax for DATE I used is for MySQL, you might adapt it for MS SQL Server.

(For MS SQL Server)

UPDATE ta 
SET ta.count = tb.cnt 
FROM tab ta, 
     (SELECT date, msg, COUNT(*) AS cnt FROM tab GROUP BY date, msg) tb 
WHERE ta.date = tb.date AND ta.msg = tb.msg AND ta.date 
BETWEEN CAST('2013-08-18' AS DATE) AND CAST('2013-08-20' AS DATE);

edited Sep 18 '13 at 14:28

answered Sep 17 '13 at 13:07

Minoru

1,680
3
20
44

I don't think that this answers the question. – Jerry Sep 17 '13 at 13:09
He also asked for **suggestions**. So... I think this is acceptable. – Minoru Sep 17 '13 at 13:11
if you could help me with SQL code for (i), I could handle (ii) – padre Sep 17 '13 at 13:21
@padre I'm trying it. – Minoru Sep 17 '13 at 13:23
@padre Check out the step *(i)*. If you have problems with step *(ii)*, just ask here and I'll try to help you later. – Minoru Sep 17 '13 at 15:32
@Lucas: Hi Lucas, the query does not give the wanted results. I think the first example was a bit confusing due to the DATE field. I posted another example, would preciate if you can come up with something for the second example. Thanks – padre Sep 18 '13 at 14:02
@padre What does my query return to you? It is supposed to update all entries with the count and just later, remove the combined records. – Minoru Sep 18 '13 at 14:04
I just checked here, the `TO_DATE` function is not working, check the edited. – Minoru Sep 18 '13 at 14:09
let me check once agains – padre Sep 18 '13 at 14:20
@padre I edited once again for the MS SQL Server syntax. Try the new one. – Minoru Sep 18 '13 at 14:27
1

it counts ALL records with same MSG within the give date range, but it should count only CONSECUTIVE records with same MSG. I have posted another example to avoid confusion. Basically, if row1,row2 has "www" as MSG, and then row5 has "www" as MSG, it should set the count to 2 for row1 and row2, and the count to 1 for row5. In your example the count is set to 3 for row1, row2 and row5 – padre Sep 18 '13 at 14:54
Oh, now I get it. Sorry for misunderstanding it. I'll think again. – Minoru Sep 18 '13 at 15:27
Hi Lucas, any suggestions? – padre Sep 19 '13 at 06:41
I don't see a way to do it with SQL queries. You will have to use Procedural Language. – Minoru Sep 19 '13 at 12:27
yes, I was thinking the same, but I was hoping that I was missing something. Thanks – padre Sep 19 '13 at 14:09

AdamL · Answer 2 · 2013-09-17T13:55:00.523

1

Try this:

DROP TABLE #temp 
GO
select
    * 
into #temp
from (
    select '1' as id,'2013-08-17' as [date], 'mail' as msg,'NULL' as [count] union all
    select '2','2013-08-17','mail','NULL' union all
    select '3','2013-08-17','www','NULL' union all
    select '4','2013-08-18','www','NULL' union all
    select '5','2013-08-18','www','NULL' union all
    select '6','2013-08-18','www','NULL' union all
    select '7','2013-08-18','mail','NULL' union all
    select '8','2013-08-18','www','NULL' union all
    select '9','2013-08-19','mail','NULL' union all
    select '10','2013-08-19','mail','NULL' union all
    select '11','2013-08-20','mail','NULL' union all
    select '12','2013-08-20','mail','NULL' union all
    select '13','2013-08-21','www','NULL' union all
    select '14','2013-08-22','mail','NULL' union all
    select '15','2013-08-22','mail','NULL' union all
    select '16','2013-08-23','mail','NULL' union all
    select '17','2013-08-23','mail','NULL' union all
    select '18','2013-08-23','mail','NULL'
) x
GO


select 
    t.*,
    rwn
from #temp t
join (
    select 
        id, [date], [msg], [rwn] = row_number() over(partition by [date], [msg] order by id )
    from #temp
    where 1=1
        and [date] between '2013-08-18' and '2013-08-22'
) x
    on t.id=x.id
 order by 
    t.date, t.msg

Just modify it for UPDATE and then delete all rows where rwn>1

EDIT: Your data type is probably text, so you get sort/comparison the errors. Do you really need text? It is a large object data type (blob), which can store several GB of text. Try changing this to varchar(8000) for example, or if these are really that big messages, varchar(max) will do, too

edited Sep 17 '13 at 13:55

answered Sep 17 '13 at 13:11

AdamL

12,421
5
50
74

I have this error: The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator – padre Sep 17 '13 at 13:21
It complains about: (partition by DATUM,MSG order by id). If I remove the MSG from the Query, then I dont get any errors, but also no results – padre Sep 17 '13 at 13:28
MSG is probably defined as `text` which you cannot sort or group on. Try casting it to `varchar(max)` like: `cast(msg as varchar(max))` – Dan Bellandi Sep 17 '13 at 13:40
no, my mistake ... it produce results, but the rwn is always set to 1 – padre Sep 17 '13 at 13:48
@padre Please see the edits. I explained the error, and posted runnable sql - works fine, so I don't know why you have rwn=1 – AdamL Sep 17 '13 at 13:56
I had to change the dates, however, the comparison should be performed only on the MSG field. I als edit the entry. – padre Sep 17 '13 at 14:13
I also posted another example – padre Sep 17 '13 at 14:19

Angelo · Answer 3 · 2013-09-18T06:27:19.353

Hi please try this hope it helps you, The way i understand is u need to group and remove duplicate and retain 1 only. sorry about my english

DECLARE @Table_2 TABLE (ID INT, [DATE] date, MSG Varchar(50), [COUNT] int)
Declare @fromDate as date = '2013-08-18'
Declare @toDate as date = '2013-08-22'

INSERT INTO @Table_2 (ID, [DATE], MSG, [COUNT])
SELECT     MAX(DISTINCT ID) AS ID, DATE, MSG, COUNT(DATE) AS COUNT
FROM         dbo.Table_1
where [DATE] between @fromDate and @toDate
GROUP BY DATE, MSG



UPDATE Table_1 
SET [COUNT] = T2.COUNT 

FROM Table_1 AS T1 INNER JOIN
@Table_2 AS T2
ON T1.ID = T2.ID

WHERE T1.ID = T2.ID


DELETE T1
FROM Table_1 AS T1
FULL OUTER JOIN @Table_2 AS T2 
ON T1.DATE = T2.DATE AND T1.MSG = T2.MSG 

WHERE (T1.DATE = T2.DATE AND T1.MSG = T2.MSG) AND T1.ID != T2.ID

Group By, Count and Delete on Consecutive Records

3 Answers3