2

I want to identify a chunk of seven characters within a text of any length:

  • Starts with a letter
  • Includes at least one number (anywhere)
  • All letters are uppercase

How would I represent this type of pattern with PATINDEX()? PATINDEX('%[A-Z]%',text) fills the first requirement but, does not fulfill the other requirement. How would I make this variable so that the numbers and letters within a seven character space can be jumbled in any way (after the first character)?

I use this to print out the chunk: SUBSTRING(MESSAGE_SUBJECT,PATINDEX('%[A-Z]%',MESSAGE_SUBJECT),7)

It doesn't seem like this is possible without CLR. To make it even simpler, would it be possible to find a seven character grouping that starts with a letter and includes one number?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75

3 Answers3

2

Per my comments above...

declare @table table (a varchar(64))
insert into @table
values
('aaaaaA123A')
,('123A')
,('A123a')
,('A123')
,('A123ADD')
,('A1DD23A')
,('aAAA1DD23A')
,('aAAAAAAA')
,('hello there AA11BB2')


select a, 1 
from @table
where 
patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0

Or you can flag it with a CASE

select
    a
    ,MeetsPattern = case 
                        when patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
                        and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
                        and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0
                        then 1
                        else 0
                    end
from @table

Or to extract it

select
    a
    ,substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)
from @table
where
patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS) > 0
and substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7) collate Latin1_General_CS_AS = upper(substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7))
and patindex('%[0-9]%',substring(a,patindex('%[A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z][0-9A-Z]%',a collate Latin1_General_CS_AS),7)) > 0
S3S
  • 24,809
  • 5
  • 26
  • 45
  • thank you - please check out my comment as well. the text can be any length, I am searching for a seven character chunk. – OverflowingTheGlass Sep 06 '18 at 19:46
  • 1
    thank you - looks like that gets close - however, it doesn't guarantee that a number is included. so, for instance "hello there AA11BB2" returns "hello t" – OverflowingTheGlass Sep 06 '18 at 20:02
  • good catch @CameronTaylor added one more line to check for that too – S3S Sep 06 '18 at 20:08
  • thank you - that looks like it's checking the already returned text. for instance "hello there AA11BB2" returns "hello t" and then the new piece you added checks "hello t" for a number. instead, it would be good to have the number check work with the initial check so "AA11BB2" is returned in the first place. – OverflowingTheGlass Sep 06 '18 at 20:17
  • I didn’t return anything other than it matches. If you want me to extractbthe text then that’s possible too – S3S Sep 06 '18 at 20:20
  • that would be awesome - basically from "hello there AA11BB2", i want to extract the "AA11BB2". – OverflowingTheGlass Sep 06 '18 at 20:22
  • See the edit @CameronTaylor and anything past that you should be able to manipulate. The only small catch is if there is seven characters together preceding the digit it would fail, but that's a limitation of patindex which coudl be hacked with `reverse()` but this would get nasty... – S3S Sep 06 '18 at 20:37
0

I don't believe that PATINDEX() is going to give you what you need. The PATINDEX() function returns the position of the first occurrence that matches your string. I think you would be happier using the LIKE() function.

Robert Columbia
  • 6,313
  • 15
  • 32
  • 40
The_Flin
  • 15
  • 1
  • 6
  • thanks - I edited my question to include the substring function I am using to print out the value based on the patindex – OverflowingTheGlass Sep 06 '18 at 19:48
  • 1
    But @The_Flin do not have enough reputation to comment. Flag it then as *Not an answer*. – Pm Duda Sep 06 '18 at 21:01
  • This is an answer. The answer says to use the LIKE() function. If you feel that this is incorrect or not useful, downvote the answer. – Robert Columbia Sep 06 '18 at 21:15
  • No, what scsimon posted is an answer. This is a comment that, in no way resolves the OP's problem. A proper answer to a question like this should include a code snippet or example that the OP can use to solve their problem accompanied by an explanation. Furthermore, `LIKE` does not offer any advantage over `PATINDEX` in this scenario, both `WHERE PATINDEX(,) > 0` and `WHERE LIKE `will return the same rows. – Alan Burstein Sep 06 '18 at 23:09
0

A CLR or Regular Expressions is not needed for something like this. Problems like this are exactly that NGrams8K was designed to solve. First for a crash course on NGrams8K.

This:

DECLARE @string VARCHAR(100) = 'ABC123XYZ'

SELECT ng.position, ng.token 
FROM   dbo.NGrams8k(@string, 7) AS ng;

Returns:

position  token
--------- -----------
1         ABC123X
2         BC123XY
3         C123XYZ

To identify a chunk of letters (AKA a substring or, in the context of N-Grams, a 7-gram) that (1) begin with a letter, includes at least one number and contains no lower case letters you could use NGrams8K like so:

DECLARE @string VARCHAR(100) = 'x96AE0E33CFD5';

SELECT       ng.position, ng.token
FROM         dbo.ngrams8k(@string,7)                       AS ng
CROSS APPLY (VALUES(ng.token COLLATE latin1_general_bin2)) AS token(cs)
WHERE        token.cs LIKE '[A-Z]%[0-9]%' 
AND          token.cs NOT LIKE '%[a-z]%'; 

which returns:

position   token
---------- ---------------
4          AE0E33C
5          E0E33CF
7          E33CFD5

As you can see, we've extracted every 7-character substring that matches your requirements. Alternatively, this will be a wee-bit more efficient:

SELECT ng.position, ng.token
FROM   dbo.ngrams8k(@string,7) AS ng
WHERE (ASCII(LEFT(ng.token,1)) - 65) & 0x7FFF < 26
AND    PATINDEX('%[a-z]%',ng.token COLLATE latin1_general_bin2) = 0;

To better understand what's going on consider this query:

DECLARE @string VARCHAR(100) = 'x96AE0E33CFD5';

SELECT       ng.position, 
             ng.token, 
             isMatch = CASE WHEN token.cs LIKE '[A-Z]%[0-9]%' 
                             AND token.cs NOT LIKE '%[a-z]%' THEN 1 ELSE 0 END
FROM         dbo.ngrams8k(@string,7)                       AS ng
CROSS APPLY (VALUES(ng.token COLLATE latin1_general_bin2)) AS token(cs);

which returns:

position   token      isMatch
---------- ---------- ---------
1          x96AE0E    0
2          96AE0E3    0
3          6AE0E33    0
4          AE0E33C    1
5          E0E33CF    1
6          0E33CFD    0
7          E33CFD5    1

Here's an example against a table where you only want to return rows that match your criteria:

DECLARE @table TABLE (someId INT IDENTITY, string VARCHAR(100));
INSERT @table(string) VALUES ('!!!!AB1234567'),('c555'),('!!ABC1234ggg')

SELECT t.someId, t.string
FROM   @table AS t
WHERE EXISTS
(
  SELECT  1
  FROM    dbo.ngrams8k(t.string,7) AS ng
  WHERE  (ASCII(LEFT(ng.token,1)) - 65) & 0x7FFF < 26
  AND     PATINDEX('%[a-z]%',ng.token COLLATE latin1_general_bin2) = 0
);
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • this is interesting - any way to only pick up 7 character chunks that have a space on either side - i.e. word-level ngrams? this method is currently picking up (from "HELLO THERE 12345") "e 12345". I would like to ignore those situations and only pick up situations where the word itself fits the criteria. So from "hello there AA11BB2", I want "AA11BB2". – OverflowingTheGlass Sep 07 '18 at 13:25
  • i see there is supposed to be a part 5 of the blog series that discusses word-level, but i can't seem to find it – OverflowingTheGlass Sep 07 '18 at 13:37