T-SQL PATINDEX Pattern of Seven Characters with at Least One Letter and One Number

Question

I want to identify a chunk of seven characters within a text of any length:

Starts with a letter
Includes at least one number (anywhere)
All letters are uppercase

How would I represent this type of pattern with PATINDEX()? PATINDEX('%[A-Z]%',text) fills the first requirement but, does not fulfill the other requirement. How would I make this variable so that the numbers and letters within a seven character space can be jumbled in any way (after the first character)?

I use this to print out the chunk: SUBSTRING(MESSAGE_SUBJECT,PATINDEX('%[A-Z]%',MESSAGE_SUBJECT),7)

It doesn't seem like this is possible without CLR. To make it even simpler, would it be possible to find a seven character grouping that starts with a letter and includes one number?

that's fine, I can add the collation on the end - my focus is the first two bullets — OverflowingTheGlass, Sep 06 '18 at 19:40
@scsimon so the string doesn't necessarily begin with the pattern. it's just any seven character chunk that begins with a letter and includes at least one number. your pattern looks like it just returns the first 7 characters of the text, regardless of number inclusion — OverflowingTheGlass, Sep 06 '18 at 19:45

S3S · Answer 1 · 2018-09-06T20:37:32.683

Per my comments above...

declare @table table (a varchar(64))
insert into @table
values
('aaaaaA123A')
,('123A')
,('A123a')
,('A123')
,('A123ADD')
,('A1DD23A')
,('aAAA1DD23A')
,('aAAAAAAA')
,('hello there AA11BB2')


select a, 1 
from @table
where 
patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0

Or you can flag it with a CASE

select
    a
    ,MeetsPattern = case 
                        when patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
                        and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
                        and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0
                        then 1
                        else 0
                    end
from @table

Or to extract it

select
    a
    ,substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)
from @table
where
patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0

thank you - please check out my comment as well. the text can be any length, I am searching for a seven character chunk. — OverflowingTheGlass, Sep 06 '18 at 19:46
thank you - looks like that gets close - however, it doesn't guarantee that a number is included. so, for instance "hello there AA11BB2" returns "hello t" — OverflowingTheGlass, Sep 06 '18 at 20:02
good catch @CameronTaylor added one more line to check for that too — S3S, Sep 06 '18 at 20:08
thank you - that looks like it's checking the already returned text. for instance "hello there AA11BB2" returns "hello t" and then the new piece you added checks "hello t" for a number. instead, it would be good to have the number check work with the initial check so "AA11BB2" is returned in the first place. — OverflowingTheGlass, Sep 06 '18 at 20:17
I didn’t return anything other than it matches. If you want me to extractbthe text then that’s possible too — S3S, Sep 06 '18 at 20:20
that would be awesome - basically from "hello there AA11BB2", i want to extract the "AA11BB2". — OverflowingTheGlass, Sep 06 '18 at 20:22
See the edit @CameronTaylor and anything past that you should be able to manipulate. The only small catch is if there is seven characters together preceding the digit it would fail, but that's a limitation of patindex which coudl be hacked with `reverse()` but this would get nasty... — S3S, Sep 06 '18 at 20:37

score 0 · Answer 2 · edited Sep 06 '18 at 21:15

0

I don't believe that PATINDEX() is going to give you what you need. The PATINDEX() function returns the position of the first occurrence that matches your string. I think you would be happier using the LIKE() function.

edited Sep 06 '18 at 21:15

Robert Columbia

6,313
15
32
40

answered Sep 06 '18 at 19:47

The_Flin

15
1
6

thanks - I edited my question to include the substring function I am using to print out the value based on the patindex – OverflowingTheGlass Sep 06 '18 at 19:48
1

But @The_Flin do not have enough reputation to comment. Flag it then as *Not an answer*. – Pm Duda Sep 06 '18 at 21:01
This is an answer. The answer says to use the LIKE() function. If you feel that this is incorrect or not useful, downvote the answer. – Robert Columbia Sep 06 '18 at 21:15
No, what scsimon posted is an answer. This is a comment that, in no way resolves the OP's problem. A proper answer to a question like this should include a code snippet or example that the OP can use to solve their problem accompanied by an explanation. Furthermore, `LIKE` does not offer any advantage over `PATINDEX` in this scenario, both `WHERE PATINDEX(,) > 0` and `WHERE LIKE `will return the same rows. – Alan Burstein Sep 06 '18 at 23:09

score 0 · Answer 3 · answered Sep 06 '18 at 23:39

A CLR or Regular Expressions is not needed for something like this. Problems like this are exactly that NGrams8K was designed to solve. First for a crash course on NGrams8K.

This:

DECLARE @string VARCHAR(100) = 'ABC123XYZ'

SELECT ng.position, ng.token 
FROM   dbo.NGrams8k(@string, 7) AS ng;

Returns:

position  token
--------- -----------
1         ABC123X
2         BC123XY
3         C123XYZ

To identify a chunk of letters (AKA a substring or, in the context of N-Grams, a 7-gram) that (1) begin with a letter, includes at least one number and contains no lower case letters you could use NGrams8K like so:

DECLARE @string VARCHAR(100) = 'x96AE0E33CFD5';

SELECT       ng.position, ng.token
FROM         dbo.ngrams8k(@string,7)                       AS ng
CROSS APPLY (VALUES(ng.token COLLATE latin1_general_bin2)) AS token(cs)
WHERE        token.cs LIKE '[A-Z]%[0-9]%' 
AND          token.cs NOT LIKE '%[a-z]%';

which returns:

position   token
---------- ---------------
4          AE0E33C
5          E0E33CF
7          E33CFD5

As you can see, we've extracted every 7-character substring that matches your requirements. Alternatively, this will be a wee-bit more efficient:

SELECT ng.position, ng.token
FROM   dbo.ngrams8k(@string,7) AS ng
WHERE (ASCII(LEFT(ng.token,1)) - 65) & 0x7FFF < 26
AND    PATINDEX('%[a-z]%',ng.token COLLATE latin1_general_bin2) = 0;

To better understand what's going on consider this query:

DECLARE @string VARCHAR(100) = 'x96AE0E33CFD5';

SELECT       ng.position, 
             ng.token, 
             isMatch = CASE WHEN token.cs LIKE '[A-Z]%[0-9]%' 
                             AND token.cs NOT LIKE '%[a-z]%' THEN 1 ELSE 0 END
FROM         dbo.ngrams8k(@string,7)                       AS ng
CROSS APPLY (VALUES(ng.token COLLATE latin1_general_bin2)) AS token(cs);

which returns:

position   token      isMatch
---------- ---------- ---------
1          x96AE0E    0
2          96AE0E3    0
3          6AE0E33    0
4          AE0E33C    1
5          E0E33CF    1
6          0E33CFD    0
7          E33CFD5    1

Here's an example against a table where you only want to return rows that match your criteria:

DECLARE @table TABLE (someId INT IDENTITY, string VARCHAR(100));
INSERT @table(string) VALUES ('!!!!AB1234567'),('c555'),('!!ABC1234ggg')

SELECT t.someId, t.string
FROM   @table AS t
WHERE EXISTS
(
  SELECT  1
  FROM    dbo.ngrams8k(t.string,7) AS ng
  WHERE  (ASCII(LEFT(ng.token,1)) - 65) & 0x7FFF < 26
  AND     PATINDEX('%[a-z]%',ng.token COLLATE latin1_general_bin2) = 0
);

this is interesting - any way to only pick up 7 character chunks that have a space on either side - i.e. word-level ngrams? this method is currently picking up (from "HELLO THERE 12345") "e 12345". I would like to ignore those situations and only pick up situations where the word itself fits the criteria. So from "hello there AA11BB2", I want "AA11BB2". — OverflowingTheGlass, Sep 07 '18 at 13:25
i see there is supposed to be a part 5 of the blog series that discusses word-level, but i can't seem to find it — OverflowingTheGlass, Sep 07 '18 at 13:37

T-SQL PATINDEX Pattern of Seven Characters with at Least One Letter and One Number

3 Answers3