How to identify and redact all instances of a matching pattern in T-SQL

Question

I have a requirement to run a function over certain fields to identify and redact any numbers which are 5 digits or longer, ensuring all but the last 4 digits are replaced with *

For example: "Some text with 12345 and 1234 and 12345678" would become "Some text with *2345 and 1234 and ****5678"

I've used PATINDEX to identify the the starting character of the pattern:

PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', TEST_TEXT)

I can recursively call that to get the starting character of all the occurrences, but I'm struggling with the actual redaction.

Does anyone have any pointers on how this can be done? I know to use REPLACE to insert the *s where they need to be, it's just the identification of what I should actually be replacing I'm struggling with.

Could do it on a program, but I need it to be T-SQL (can be a function if needed).

Any tips greatly appreciated!

. . Although you could write a UDF to do this, SQL Server is not really the right tool for this problem. — Gordon Linoff, Oct 22 '19 at 12:33
@GordonLinoff Thanks, unfortunately this is part of a technically restricted delivery. This part is pulling data from a source, but I have to redact that data at the query level, and using SQL. So it has to be T-SQL, the client has mentioned I can make a function on the SQL server if I really need to, but other than that they've restricted me to T-SQl — Nyk, Oct 22 '19 at 12:36
Even in programming languages which support regex, your requirement is still not staightforward. You would need a regex replacement with a callback function most likely. — Tim Biegeleisen, Oct 22 '19 at 12:38

score 2 · Accepted Answer · answered Oct 22 '19 at 13:54

2

You can do this using the built in functions of SQL Server. All of which used in this example are present in SQL Server 2008 and higher.

DECLARE @String VARCHAR(500) = 'Example Input: 1234567890, 1234, 12345, 123456, 1234567, 123asd456'
DECLARE @StartPos INT = 1, @EndPos INT = 1;
DECLARE @Input VARCHAR(500) = ISNULL(@String, '') + ' '; --Sets input field and adds a control character at the end to make the loop easier.
DECLARE @OutputString VARCHAR(500) = ''; --Initalize an empty string to avoid string null errors

WHILE (@StartPOS <> 0)
BEGIN
    SET @StartPOS = PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', @Input);
    IF @StartPOS <> 0
    BEGIN
        SET @OutputString += SUBSTRING(@Input, 1, @StartPOS - 1); --Seperate all contents before the first occurance of our filter
        SET @Input = SUBSTRING(@Input, @StartPOS, 500); --Cut the entire string to the end. Last value must be greater than the original string length to simply cut it all.

        SET @EndPos = (PATINDEX('%[0-9][0-9][0-9][0-9][^0-9]%', @Input)); --First occurance of 4 numbers with a not number behind it.
        SET @Input = STUFF(@Input, 1, (@EndPos - 1), REPLICATE('*', (@EndPos - 1))); --@EndPos - 1 gives us the amount of chars we want to replace.
    END
END
SET @OutputString += @Input; --Append the last element

SET @OutputString = LEFT(@OutputString, LEN(@OutputString))
SELECT @OutputString;

Which outputs the following:

Example Input: ******7890, 1234, *2345, **3456, ***4567, 123asd456

This entire code could also be made as a function since it only requires an input text.

answered Oct 22 '19 at 13:54

D Kramer

728
6
16

This works also but I do everything I can to avoid loops like the plague. – Sean Lange Oct 22 '19 at 13:57
That looks pretty bang on - I need to head off at the minute but I'll get this all checked out tomorrow and update the question. Thanks very much for your input! – Nyk Oct 22 '19 at 13:58
@SeanLange I agree, but from a programming perspective i tried to keep it as native and 'relatively' clean as possible as well as easily understandable. And am not a fan of simply throwing in external functions/procedures into databases. Even though i have Jeff Moden's splitter on my 2008 Sql Server. – D Kramer Oct 22 '19 at 14:04
1

@DKramer I don't disagree and was not at all slamming your approach. But I think once we are going beyond selecting data and doing weird stuff like this then you have to start thinking out of the box. And this isn't an external function, the entire function is nothing but pure t-sql. I am far more concerned about performant queries than keeping everything "clean". Just my 2¢. Notice in your example here that to do this from a table would require another outer loop. – Sean Lange Oct 22 '19 at 14:06
1

@SeanLange You're right, and whilst i was testing out my code/example and adding in comments there were no answers yet. Your solution, in my eyes is also more suited for this problem. It just didn't springto mind for myself and OP mentioned he can make a function (only) if truly needed, so i tried to take as bare a bones approach as I could. – D Kramer Oct 22 '19 at 14:20
1

To be fair I through together a table with your sample data and the OPs sample. Then wrapped your logic in a cursor. I also tweaked mine a little bit to add a preliminary cte for the TRY_CONVERT. To increase row count I just inserted copies of the same data. At less than 50 rows there was no comparison. The loop was slower by orders of magnitude (but of course not noticeable). Once the table expanded to around 15-20k rows it got closer in performance. Around 100k rows your looping process actually started to be a little quicker. And at a million rows your nested loops were markedly quicker. – Sean Lange Oct 22 '19 at 15:14
This is absolutely perfect - exactly what I needed. I created it as function with some minor amendments to accommodate my data, namely changing the size of the VARCHARs to MAX as the source data has often enormous quantities of text far exceeding 500 characters and a quick check at the start to return NULL where the input is NULL (as the column I'm operating on can also be NULL). Thank you so much for help @DKramer, I was really stuck with this one, and to everyone else who provided tips and code! Marking this as the answer as it best fits my requirement. – Nyk Oct 23 '19 at 06:53

score 1 · Answer 2 · answered Oct 22 '19 at 13:13

1

A dirty solution with recursive CTE

DECLARE 
  @tags nvarchar(max) = N'Some text with 12345 and 1234 and 12345678',
  @c nchar(1) = N' ';
;
WITH Process (s, i)
as
(
SELECT @tags, PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', @tags)
UNION ALL 
SELECT value,  PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', value)
FROM
(SELECT SUBSTRING(s,0,i)+'*'+SUBSTRING(s,i+4,len(s)) value
FROM Process
WHERE i >0) calc
  -- we surround the value and the string with leading/trailing ,
  -- so that cloth isn't a false positive for clothing
) 
SELECT * FROM Process
WHERE i=0

I think a better solution it's to add clr function in Ms SQL Server to manage regexp. sql-clr/RegEx

answered Oct 22 '19 at 13:13

Tohm

305
1
5

This isn't quite the results they are looking for. Really close though. – Sean Lange Oct 22 '19 at 13:29
Thanks Tohm, really helpful starting point. It needs to, however, replace each character of the redaction with a *, not the whole pattern (12345678 = ****5678, not *5678 as the code currently does). I'll play around with this and see what I can do. Thanks :-) – Nyk Oct 22 '19 at 13:32
just try with ```SELECT SUBSTRING(s,0,i)+'****'+SUBSTRING(s,i+4,len(s)) value ``` – Tohm Oct 22 '19 at 13:56

score 1 · Answer 3 · answered Oct 22 '19 at 13:34

Here is an option using the DelimitedSplit8K_LEAD which can be found here. https://www.sqlservercentral.com/articles/reaping-the-benefits-of-the-window-functions-in-t-sql-2 This is an extension of Jeff Moden's splitter that is even a little bit faster than the original. The big advantage this splitter has over most of the others is that it returns the ordinal position of each element. One caveat to this is that I am using a space to split on based on your sample data. If you had numbers crammed in the middle of other characters this will ignore them. That may be good or bad depending on you specific requirements.

declare @Something varchar(100) = 'Some text with 12345 and 1234 and 12345678';

with MyCTE as
(
    select x.ItemNumber 
        , Result = isnull(case when TRY_CONVERT(bigint, x.Item) is not null then isnull(replicate('*', len(convert(varchar(20), TRY_CONVERT(bigint, x.Item))) - 4), '') + right(convert(varchar(20), TRY_CONVERT(bigint, x.Item)), 4) end, x.Item)
    from dbo.DelimitedSplit8K_LEAD(@Something, ' ') x
)
select Output = stuff((select ' ' + Result 
                        from MyCTE 
                        order by ItemNumber
                        FOR XML PATH('')), 1, 1, '')

This produces: Some text with *2345 and 1234 and ****5678

That looks very promising, although to be honest I don't understand how it works from a cursory glance - I'll read over the link and the code in more detail and let you know if it address the issue. Thanks a lot :-) — Nyk, Oct 22 '19 at 13:50
The DelimitedSplit8K function is where the magic happens. The cte here just isolates if the element is a number or not. And if it is a number it swaps out the digits for * except for the last 4. — Sean Lange, Oct 22 '19 at 13:55

How to identify and redact all instances of a matching pattern in T-SQL

3 Answers3