4

I need to find invalid social security numbers in a varchar field in a SQL Server 2008 database table. (Valid SSNs are being defined by being in the format ###-##-#### - doesn't matter what the numbers are, as long as they are in that "3-digit dash 2-digit dash 4-digit" pattern.

I do have a working regex:

SELECT * 
FROM mytable
WHERE ssn NOT LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'

That does find the invalid SSNs in the column, but I know (okay - I'm pretty sure) that there is a way to shorten that to indicate that the previous pattern can have x iterations.

I thought this would work:

'[0-9]{3}-[0-9]{2}-[0-9]{4}'

But it doesn't.

Is there a shorter regex than the one above in the select, or not? Or perhaps there is, but T-SQL/SQL Server 2008 doesn't support it!?

Michael
  • 41,989
  • 11
  • 82
  • 128
marky
  • 4,878
  • 17
  • 59
  • 103
  • 1
    TSQL doesn't have native Regex support. It supports `LIKE` which you're using but that's a very simple language - not really regex at all. A better plan might be to store just 9 digits (quite simple to create constraints for that) and just insert the dashes when *displaying* the SSN - unless you actually need to support storing invalid data. – Damien_The_Unbeliever Aug 03 '15 at 13:09

2 Answers2

2

If you plan to get a shorter variant of your LIKE expression, then the answer is no.

In T-SQL, you can only use the following wildcards in the pattern:

%
- Any string of zero or more characters. WHERE title LIKE '%computer%' finds all book titles with the word computer anywhere in the book title.

_ (underscore)
Any single character. WHERE au_fname LIKE '_ean' finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ]
Any single character within the specified range ([a-f]) or set ([abcdef]). WHERE au_lname LIKE '[C-P]arsen' finds author last names ending with arsen and starting with any single character between C and P, for example Carsen, Larsen, Karsen, and so on. In range searches, the characters included in the range may vary depending on the sorting rules of the collation.
[^]
Any single character not within the specified range ([^a-f]) or set ([^abcdef]).

So, your LIKE statement is already the shortest possible expression. No limiting quantifiers can be used (those like {min,max}), not shorthand classes like \d.

If you were using MySQL, you could use a richer set of regex utilities, but it is not the case.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

I suggest you to use another solution like this:

-- Use `REPLICATE` if you really want to use a number to repeat
Declare @rgx nvarchar(max) = REPLICATE('#', 3) + '-' +
                             REPLICATE('#', 2) + '-' +
                             REPLICATE('#', 4);

-- or use your simple format string
Declare @rgx nvarchar(max) = '###-##-####';

-- then use this to get your final `LIKE` string.
Set @rgx = REPLACE(@rgx, '#', '[0-9]');

And you can also use something like '_' for characters then replace it with [A-Z] and so on.

shA.t
  • 16,580
  • 5
  • 54
  • 111