2

I need to find frequency of sequences of data rows.I have like 17000 rows data which is including almost 120 different kind of data and I need to find which sequences of data how many times repeated?

For instance:

a
b
c
a
b
d
a
b
c

I am trying to find repetition order some people call it frequency sequence. So how many times aa and abc and ab and bc and abca and so on are there in this column? I mean I need to find how many times this data has same group of rows.

For this example it has 4 different data so there are a lot of combination of it. For calculation: C(4,1)*4!+C(4,2)*2!+C(4,3)*3!+C(4,4) different order and I need to calculate it for each order how many times are there?

My short part of real column data example:(each contiguous data is equal to a row)

3E010000 
2010000
2010007
2010008
2010000
2010003
2010009
0201000A 
0B01000C 
2010002
3E010000 
2010000
2010007
0B010014 
2010009
0201000A 
0B01000C 
2010002

Now if you could check this whole main column for this group of data:

3E010000 
2010000
2010007

and this

3E010000 
2010000

and this

2010009
0201000A 
0B01000C 
2010002

and so on. You can see they were repeated more than once.

These rows repeat in the main first column and I am trying to find 1,2,3,4 and max 5 groups of orders out of 120 kind of data combination.

I am using Microsoft SQL Server 2014. But if it is not possible in Microsoft SQL Server, then you could give me any advice or other tools. Could you help me please? Thank you so much!

Output:

0B010009 ,0B010009,0B010009,2010005,2010005,2010005     2   9

0B010014 ,0B010014,0B010014,16010002,16010002,16010002     2    3    

2010002,2010002,0201FFE0,0201FFE0       2   13    

0B0114B5 ,0B0114B5,0B0114B5,2010002,2010002,2010002,2010004,2010004,2010004    3    3    

070105B3 ,070105B3,070105B3,2010005,2010005,2010005,0201FFE1 ,0201FFE1,0201FFE1 
   3    2    

3E010000 ,3E010000,3E010000,0B010010,0B010010,0B010010 ,0B01F61D ,0B01F61D,0B01F61D     3   6    

3E010002 ,3E010002,3E010002,0B010013,0B010013,0B010013 ,0B01F80D ,0B01F80D,0B01F80D    3    3    

0B010003 ,0B010003,2010006,2010006,0B01000A ,0B01000A,2010005,2010005   4   2    

0B01FFE1 ,0B01FFE1,0B01FFE1,0B010013,0B010013,0B010013 ,0B01EAD0 ,0B01EAD0,0B01EAD0,0B010004,0B010004,0B010004     4    4     

0B01000C ,0B01000C,0B01000C,0B01FCBD,0B01FCBD,0B01FCBD ,0701FFE0 ,0701FFE0,0701FFE0,0B01000A,0B01000A,0B01000A     4    5  
Alex
  • 4,885
  • 3
  • 19
  • 39
gn khrmn
  • 23
  • 1
  • 6
  • 2
    SQL tables represent unordered sets. There are no sequences unless a column specifies the ordering. – Gordon Linoff Nov 15 '17 at 02:17
  • 2
    @ganikhrmn, SO is not a forum, please Edit your original question to include additional information. – Alex Nov 15 '17 at 02:40
  • l am a new user in SO and sorry about it. I have just edited it. Thank you for informing. l hope it is more clear now. – gn khrmn Nov 15 '17 at 03:06
  • How is the value order defined? Do you have e.g. an ID column? – Alex Nov 15 '17 at 03:30
  • I believe this solves half of you problem, namely how to search for known patterns: https://stackoverflow.com/questions/31703765/how-do-you-identify-record-pattern-sequences-in-records-using-tsql . – Alex Nov 15 '17 at 03:34
  • Thank you so much for answering. Actually there is no id order l just imported this column from excel, but l can give an order of course. And actually, this solution is maybe possible but for small data. Because in my problem it has a huge combination. Almost 120 different kind of data and 17000 rows. When we calculate combination of it for C(120,1)+C(120,2)+C(120,3)+C(120,4)+C(120,5) like that it is a lot but thanks again l will search it with these key words. – gn khrmn Nov 15 '17 at 03:52
  • You must give it an ID, as @GordonLinoff said SQL Server stores data in unordered sets. If yo have no ID, then there is no relationship between rows. – Alex Nov 15 '17 at 03:56

1 Answers1

1

Below query finds duplicate patterns for 2, 3, 4 and 5 repeating rows.

It uses 'LEAD' and 'HASHBYTES' functions.

Query works by computing a hash sequence for values in current row + following rows and then grouping on these hash values to find "duplicate" patterns. This process is done for each row.

Note: an ever increasing sequence column (to denote row position) i.e. ID is assumed.

CREATE TABLE #Data( ID INT IDENTITY PRIMARY KEY, Val VARCHAR( 20 ))
INSERT INTO #Data
VALUES
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '2010008' ), ( '2010000' ),
( '2010003' ), ( '2010009' ), ( '0201000A' ), ( '0B01000C' ), ( '2010002' ),
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '0B010014' ), ( '2010009' ),
( '0201000A' ), ( '0B01000C' ), ( '2010002' )


SELECT Pat3Rows, COUNT(*) AS Cnt
FROM(
    SELECT *,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
    FROM #Data AS D1
) AS HashedGroups
GROUP BY Pat3Rows
HAVING COUNT(*) > 1

Note: there is a possibility, albeit extremely remote, of encountering hash collisions, so the above logic is not guaranteed to handle all theoretically possible cases. In summary I would not recommend using it if someone's life depends on the procedure to always be 100% accurate.

You did not specify how the output should look like so I will leave this up to you.

I have also tested this on my laptop with 18,000 rows and it produces a result in less than 1 second.

Sample use case:

;WITH DataHashed AS(
SELECT *,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data ),
RepeatingPatterns AS(
    SELECT MIN( ID ) AS FirstRow, Pat2Rows AS PatternHash, 2 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat2Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat3Rows, 3 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat3Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat4Rows, 4 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat4Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat5Rows, 5 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat5Rows HAVING COUNT(*) > 1
)
--SELECT * FROM RepeatingPatterns
SELECT 
     CONVERT( VARCHAR( 50 ), SUBSTRING(
        ( SELECT ',' + D.Val  AS [text()]
        FROM #Data AS D
        WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )
        ORDER BY D.ID
        FOR XML PATH ('')
        ), 2, 1000 )) AS Pattern, CONVERT( VARCHAR( 35 ), PatternHash, 1 ) AS PatternHash, RP.PatternSize, Cnt
FROM RepeatingPatterns AS RP

Sample output:

Pattern                                            PatternHash                         PatternSize Cnt
-------------------------------------------------- ----------------------------------- ----------- -----------
0201000A,0B01000C                                  0x499D8B1750A9BF57795B4D60D58DCF81  2           2
2010000,2010007                                    0x7EDE1E675D934F3035DACAC53F74DD14  2           2
3E010000,2010000                                   0x85FBFD817CFBB9BD08E983671EB594B7  2           2
2010009,0201000A                                   0x8E18E36B989BD859AF039238711A7F8C  2           2
0B01000C,2010002                                   0xF1EABB115FB3AEF2D162FB3EC7B6AFDA  2           2
0201000A,0B01000C,2010002                          0x6DE203B38A13501881610133C1EDBF85  3           2
2010009,0201000A,0B01000C                          0x9EB3ACFE8580A39FC530C7CA54830602  3           2
3E010000,2010000,2010007                           0xE414661F54C985B7ED9FA82FF05C1219  3           2
2010009,0201000A,0B01000C,2010002                  0x7FCDB748E37A6F6299AE8B269A4B0E49  4           2
Alex
  • 4,885
  • 3
  • 19
  • 39
  • Thank you so much. l have just recieved a mistake in output first column pattern not show correctly instead of (,) it gives me weird character like but other is working correctly. Thanks again for helping. – gn khrmn Nov 15 '17 at 04:58
  • It must be a collation problem, try changing column type to `NVARCHAR` e.g. `CONVERT( VARCHAR( 50 )` -> `CONVERT( NVARCHAR( 50 )`. Also make sure you SQL Editor supports unicode. – Alex Nov 15 '17 at 04:59
  • Thank you so much. l am new in here as l said before and are there any places to give you grade 5 star something like that? You really deserve it. Awesome job. Thanks again. – gn khrmn Nov 15 '17 at 05:16
  • You can also upvote my answer by clicking on the "up" triangle just above the answer score :) – Alex Nov 15 '17 at 05:17
  • May l ask a question? When l apply it to whole data, lt shows something like that output and it looks like wrong. l wrote part of output end of my question. Especially first pattern column not normal shows more than group and instead of 2 different data give me 3 always 1 more. Could you check it again? – gn khrmn Nov 15 '17 at 07:51
  • @ganikhrmn, this is embarrassing, I got the `WHERE D.ID BETWEEN RP.FirstRow AND RP.FirstRow + RP.PatternSize` condition completely wrong. I have replaced it with: `WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )`. I have updated "Sample Use Case" and "Sample output" in my answer. – Alex Nov 15 '17 at 09:37
  • Now it is working correctly but if you check my output again there is a repetition in pattern column. l tried to figure out but could not be successful. Could you check my output again please? Thanks. @Alex – gn khrmn Nov 15 '17 at 17:27
  • I don't know what the problem may be. Is your query the same as what I provided? How do you assign IDs in the table? Add `FirstRow` column to the last select statement to find out the starting ID for a given pattern, then search your `#Data` table for given ID to see if you may have any duplicates – Alex Nov 15 '17 at 20:51
  • Note that some of the values contain a trailing blank space. I suggest that you trim it. I have also updated my answer. – Alex Nov 15 '17 at 21:25
  • Thank you so much for helping l really appreciate it. But still l could not fix it also l calculate some group of it ,for example, there are one pattern more than 4 times, but output say 2 times. ( for this pattern 0B01000C,2010006 ) Would you like to see my original data which is in excel? How can l send it to you? – gn khrmn Nov 15 '17 at 21:34
  • You have trailing blanks in data, you need to cleanse your data first. Run and update with `RTRIM` function on your values. Then if your still have odd things happening, have a look good look at the data, pay attention to trailing blanks or other characters that may look similar but are different. – Alex Nov 15 '17 at 21:36
  • Thank you so much yes now everything looks like correct. I am just checking randomly some pattern and count them. You are really an expert for sql. Awesome. – gn khrmn Nov 15 '17 at 22:39
  • After more than a year you want to know the answer? – Alex Mar 24 '19 at 22:02