3

I built a query that finds the longest common substrings of a column and orders them by frequency. The problem I'm having is removing/grouping similar results.

Here's the TOP 5 output from the code below - note how "I love mittens the cat" is the longest, most frequent string, but the code also finds all subsidiaries of that string such as "I love mittens the ca" or "I love mittens the c".

    I love Mittens the cat  3
    I love Mittens the ca   3
     love Mittens the cat   3
     love Mittens the ca    3
    I love Mittens the c    3

If possible, I want to remove any substrings that are similar to others that have partial words. Row 3 would be fine since it's all full words, but rows 4 and 5 should be removed since they're similar to row 1.

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE     @MaxLength INT          = 50    --Maximum Substring Length
DECLARE     @Delimeter VARCHAR(5)   = ' '
DECLARE     @T TABLE
            (
                  ID INT IDENTITY
                , chvStrings VARCHAR(64)  
            )
INSERT INTO @T VALUES
            ('I like cats'),
            ('I like dogs'),
            ('cats are great'),
            ('look at that cat'),
            ('I love Mittens the cat'),
            ('I love Mittens the cat a lot'),
            ('I love Mittens the cat so much'),
            ('Dogs are okay, I guess...')

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
    COUNT(M.number) AS [Count]
FROM        
    @T as T
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE       
    N.number <= LEN(t.chvStrings) - M.number + 1
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR  N.number = 1)
GROUP BY  
    SUBSTRING(T.chvStrings, N.Number, M.Number)                      
ORDER BY    
    COUNT(T.chvStrings) DESC,
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Fubudis
  • 241
  • 3
  • 6
  • 17
  • 1
    Unless you have table of ALL words how would you know what combinations of letters makes a complete word? – Sean Lange Jun 08 '16 at 21:14
  • 2
    Such a great job posting sample data and making it easy for people to help. I just wish I knew what you are trying to do here. It is not clear to me what you are really trying to accomplish. – Sean Lange Jun 08 '16 at 21:15
  • I can make a table of all words with a delimiter function probably but that seems like it'd be pretty taxing. To elaborate a bit, let's assume I work for Staples and we want to see what people are writing on their business cards they print. We have a DB field that stores all the text customers have on their cards as a single string. I'm interested in finding out what the most popular themes for business cards are. This query was an attempt at finding that information, the problem is you get a lot of extremely similar strings (like the example) and its too much to sort through. – Fubudis Jun 08 '16 at 21:23
  • If you can just explain what you are trying to do more clearly we can probably find a way to make it happen. – Sean Lange Jun 08 '16 at 21:26
  • To elaborate a bit, let's assume I work for Staples and we want to see what people are writing on their business cards they print. We have a DB field that stores all the text customers have on their cards as a single string. I'm interested in finding out what the most popular themes for business cards are. This query was an attempt at finding that information, the problem is you get a lot of extremely similar strings (like the example) and its too much to sort through. – Fubudis Jun 08 '16 at 21:42
  • @Fubudis, why row two is fine? it contains the incomplete word (ca) – FLICKER Jun 08 '16 at 23:12
  • @FLICKER sorry meant row 3 good catch – Fubudis Jun 08 '16 at 23:51
  • So you just want substrings that contain full words then. I mean that's the point right? – Dresden Jun 08 '16 at 23:59
  • Just substring your sentences based on space character as a delimiter, not full characters. – Jorge Campos Jun 09 '16 at 00:01
  • What is your sql server version? – Jorge Campos Jun 09 '16 at 00:02
  • @JorgeCampos MS SQL Server 10.50.6000.34 – Fubudis Jun 09 '16 at 01:12
  • @Dresden That's right – Fubudis Jun 09 '16 at 01:14

1 Answers1

1

I have added a couple of extra filters to say that the substring N.Number-1 must not contain letters [a-z0-9], and similarly substring M.Number+1 must not be [a-z0-9].

Is this what you need. Modified code below:

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE     @MaxLength INT          = 50    --Maximum Substring Length
DECLARE     @Delimeter VARCHAR(5)   = ' '
DECLARE     @T TABLE
            (
                  ID INT IDENTITY
                , chvStrings VARCHAR(64)  
            )
INSERT INTO @T VALUES
            ('I like cats'),
            ('I like dogs'),
            ('cats are great'),
            ('look at that cat'),
            ('I love Mittens the cat'),
            ('I love Mittens the cat a lot'),
            ('I love Mittens the cat so much'),
            ('Dogs are okay, I guess...')

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number,  M.Number) AS Word,
    COUNT(M.number) AS [Count]
    --SUBSTRING(T.chvStrings,M.Number+1,1)
FROM        
    @T as T
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE       
    N.number <= LEN(t.chvStrings) - M.number + 1
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR  N.number = 1) 
    AND SUBSTRING(T.chvStrings,M.Number+1,1) NOT LIKE '%[a-z0-9]%'
    AND SUBSTRING(T.chvStrings,N.Number-1,1) NOT LIKE '%[a-z0-9]%'
GROUP BY  
    SUBSTRING(T.chvStrings, N.Number, M.Number)                      
ORDER BY    
    COUNT(T.chvStrings) DESC,
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
Thomas Steven
  • 449
  • 5
  • 13