I built a query that finds the longest common substrings of a column and orders them by frequency. The problem I'm having is removing/grouping similar results.
Here's the TOP 5 output from the code below - note how "I love mittens the cat" is the longest, most frequent string, but the code also finds all subsidiaries of that string such as "I love mittens the ca" or "I love mittens the c".
I love Mittens the cat 3
I love Mittens the ca 3
love Mittens the cat 3
love Mittens the ca 3
I love Mittens the c 3
If possible, I want to remove any substrings that are similar to others that have partial words. Row 3 would be fine since it's all full words, but rows 4 and 5 should be removed since they're similar to row 1.
DECLARE @MinLength INT = 5 --Minimum Substring Length
DECLARE @MaxLength INT = 50 --Maximum Substring Length
DECLARE @Delimeter VARCHAR(5) = ' '
DECLARE @T TABLE
(
ID INT IDENTITY
, chvStrings VARCHAR(64)
)
INSERT INTO @T VALUES
('I like cats'),
('I like dogs'),
('cats are great'),
('look at that cat'),
('I love Mittens the cat'),
('I love Mittens the cat a lot'),
('I love Mittens the cat so much'),
('Dogs are okay, I guess...')
SELECT TOP 10000
SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
COUNT(M.number) AS [Count]
FROM
@T as T
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE
N.number <= LEN(t.chvStrings) - M.number + 1
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1)
GROUP BY
SUBSTRING(T.chvStrings, N.Number, M.Number)
ORDER BY
COUNT(T.chvStrings) DESC,
LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC