SQL script for removing extra characters

Question

I've got MSSQL 2012 database with some data issues in the certain column A which contains text.

There are many occurences of aditional unnecesarry character after the  tag, for instance:

'<B>Something</B>g' where should stand '<B>Something</B>'
'<B>SomethingElse</B>e' where should stand '<B>SomethingElse</B>'

Previous values are part of a greater text, for instance and can occur more than once -> Column example:

'Some text is here <B>Something</B>g and also here <B>SomethingElse</B>e more text'

Those 'extra' characters are always the same as the last character between the  tags. I would like to create SQL scripts which will:

Remove extra character after  tag
Only if extra character is the same as the last character between the  tags (as a aditional check). EDIT: This is not absolutely necessary

I assuming there is a way of calling replace function, like in this pseudo in which X represents any character.

replace(X</B>X, X</B>);

But I am not very good in SQL and also I don't know how to implement 2. check.

Thank you for your help.

possible dublicate , try: http://stackoverflow.com/questions/194652/sql-server-regular-expressions-in-t-sql] — SnakeFoot, Mar 24 '16 at 12:15

trincot · Accepted Answer · 2016-03-24T15:18:18.823

If your column has no other characters then just those strings, you could use this update statement on column a:

update  mytable
set     a = left(a, len(a)-1)
where   left(right(a, 6), 5) = right(a, 1) + '</B>'

Here are some test cases in a fiddle.

To replace such occurrences in longer strings, where there might be multiple of them, then you can use this recursive query:

WITH recursive AS (
    SELECT replace(a, '</B>', 'µ') as a
    FROM   mytable
    UNION ALL
    SELECT stuff(a, charindex('µ', a),
           CASE WHEN substring(a, charindex('µ', a)-1, 1)
                   = substring(a, charindex('µ', a)+1, 1)
                THEN 2 
                ELSE 1 
           END, '</B>')
    FROM   recursive
    WHERE  charindex('µ', a) > 0
)
SELECT * 
FROM   recursive
WHERE  charindex('µ', a) = 0

The character µ that appears in several places should be a character that you do not expect to ever have in your data. Replace it by another character if necessary.

Here is a fiddle.

The above query turned into an update statement looks like below. It assumes that your table has a primary key id:

WITH recursive AS (
    SELECT id, 
           replace(a, '</B>', 'µ') as a,
           0 as modified
    FROM   mytable
    UNION ALL
    SELECT id, 
           stuff(a, charindex('µ', a),
           CASE WHEN substring(a, charindex('µ', a)-1, 1)
                   = substring(a, charindex('µ', a)+1, 1)
              THEN 2 ELSE 1 END, '</B>'),
           1
    FROM   recursive
    WHERE  charindex('µ', a) > 0
)
UPDATE     mytable
SET        a = recursive.a
FROM       recursive
INNER JOIN mytable 
        ON mytable.id = recursive.id 
WHERE      charindex('µ', recursive.a) = 0 
AND        recursive.modified = 1;

Here is the fiddle for that as well.

I just edited. The answer is fine but in my case the column consist not only "valuev" but it could consist arbitrary text everywhere, e.g "This is some text valuee and once more valuess and here" — Filip, Mar 24 '16 at 13:30
Ok, thanks. If it is very difficult, the check under 2) is not absolutely necessary — Filip, Mar 24 '16 at 14:24

score 1 · Answer 2 · answered Mar 24 '16 at 12:49

You can create a scalar function:

CREATE FUNCTION [dbo].[RemoveChars] 
(
    -- Add the parameters for the function here
    @InputStr NVARCHAR(50)
)
RETURNS NVARCHAR(50)
AS
BEGIN
    DECLARE @SearchStr NVARCHAR(4) = '</B>'
    DECLARE @LastChar CHAR(1)
    DECLARE @LastCharInStr CHAR(1)
    DECLARE @Result NVARCHAR(50)

    SET @LastChar = SUBSTRING(@InputStr, 
                    CHARINDEX(@SearchStr, @InputStr) + LEN(@SearchStr), 1)
    SET @LastCharInStr = SUBSTRING(@InputStr, 
                         CHARINDEX(@SearchStr, @InputStr) - 1, 1)

    IF (@LastCharInStr = @LastChar) 
            SET @Result = SUBSTRING(@InputStr, 0, 
            CHARINDEX(@SearchStr, @InputStr) + LEN(@SearchStr))
    ELSE
        SET @Result = @InputStr

    RETURN @Result
END

And then call it:

UPDATE MyTable
Set A = dbo.RemoveChars(A)

Personally I would create a second function to only apply the updates to the values that have a difference between the last char in the string and the char after the but that's for you to decide.

I just spotted @trincot 's answer and I guess it is far simpler than what I provided you with and his answer includes the conditional part already saving you having to create second function. — Thierry, Mar 24 '16 at 12:52

SQL script for removing extra characters

2 Answers2