0

Using SQL Server 2016, I have a need to scrub white space a certain way and implement INITCAP.

The whitespace scrubber is simple. I'm having trouble getting the INITCAP replacement working properly.

The accepted answer to Initcap equivalent in mssql is wrong, as noted in the first comment.

My data contains values that have multiple spaces in a row and special characters, (&, %, etc.).

stuff(): In SQL Server 2016, string_split does not have an option to prove an ordinal value and does not guarantee that the results are returned in any specific order. So, I need to write code to ensure values are returned from split_string in the correct order.

convert(xml,...): Decodes most of the XML-encoded values.

convert(varchar(max),...): ...because XML can't be used when needing SELECT DISTINCT

SQL Fiddle

with T as (
  select *
  from (
  values ('Minesota Mining and   Manufacturing')
  , ('Minesota Mining & Manufacturing   ')
  , (' tillamook')
  , ('MUTUAL OF OMAHA')
  , ('   ')
  ) q(s)
),
scrubbed as (
  select T.s as InitialValue
  , CASE 
      WHEN LEN(RTRIM(T.s)) > 0
        THEN LTRIM(RTRIM(T.s))
    END as s
  from T
)
select distinct s.InitialValue
, stuff(
    (
      SELECT ' ' + t2.word
      from (
          select str.value
          , upper(substring(str.value, 1, 1)) + 
            case when len(str.value) > 1 then lower(substring(str.value, 2, len(str.value) - 1)) else '' end as word
          , charindex(' ' + str.value + ' ', ' ' + s.s + ' ') as idx
          from string_split(s.s, ' ') str
        ) t2
      order by t2.idx
      FOR XML PATH('')
    ), 
    1, 
    1, 
    ''
  ) as INITCAP_xml
, convert(
    varchar(max), 
    convert(
      xml, 
      stuff(
        (
          SELECT ' ' + t2.word
          from (
              select str.value
              , upper(substring(str.value, 1, 1)) + 
                case when len(str.value) > 1 then lower(substring(str.value, 2, len(str.value) - 1)) else '' end as word
              , charindex(' ' + str.value + ' ', ' ' + s.s + ' ') as idx
              from string_split(s.s, ' ') str
            ) t2
          order by t2.idx
          FOR XML PATH('')
        ), 
        1, 
        1, 
        ''
      )
    )
  ) as INITCAP_decoded
from scrubbed s

You see in the output that using FOR XML causes some of the characters to be encoded (like [space] = $#x20; and & = &). By converting to XML data type, some of those characters are decoded. But some characters (like &) remain encoded.

InitialValue INITCAP_attempt1 INITCAP_xml INITCAP_decoded
Minesota Mining and Manufacturing Minesota Mining And   Manufacturing Minesota Mining And Manufacturing Minesota Mining And Manufacturing
Minesota Mining & Manufacturing Minesota Mining & Manufacturing Minesota Mining & Manufacturing Minesota Mining & Manufacturing
 tillamook Tillamook Tillamook Tillamook
MUTUAL OF OMAHA Mutual Of Omaha Mutual Of Omaha Mutual Of Omaha
    null null null

REPLACE(s, '&', '&') doesn't seem like a reasonable option because I don't know what other values I'll run into over time. Is there a good, general way to handle characters that will be encoded by FOR XML?

Within a view (so, without using user defined functions or stored procedures), is there a better way to implement INITCAP in SQL Server?

John Cappelletti
  • 79,615
  • 7
  • 44
  • 66
dougp
  • 2,810
  • 1
  • 8
  • 31
  • 2
    Seems like trying so use screws as roofing nails and pounding them in with the handle of a screwdriver... – Jason Goemaat Feb 09 '23 at 19:49
  • That's just confusing. Are you saying I should just use the `INITCAP()` function that is built into SQL Server 2016? Mine doesn't seem to have that function. – dougp Feb 09 '23 at 20:02
  • I'm just thinking how it could really be a performance trap. There's no built-in function for it. If the original value isn't needed, why not fix it when it's inserted? Maybe your table is small enough that it doesn't matter much, but if you have millions of rows and someone tries to filter based on that row, it has to do a table scan and run the transformation for every row every time. – Jason Goemaat Feb 10 '23 at 15:42
  • It's for a view that gets used once per day to load data from one system to another system. I'm having to change where in the process the view is run -- moving it upstream. The upstream data has not been processed by the ETL, so I'm having to bake some of the ETL logic into the view to make the data exactly match what the downstream application is currently getting (and expects). An exact match is important. Plus, at a maximum of about 3700 rows per day, it should not be a performance problem. – dougp Feb 10 '23 at 19:38

3 Answers3

2

If interested in a SVF, here is a scaled down version which allows customization and edge events. For example rather than Phd, you would get PhD ... MacDonald, O'Neil

This is a dramatically scaled down version.. My rules/exceptions are in a generic mapping table.

Example

 select *
       ,[dbo].[svf-Str-Proper] (S)
  from (
  values ('Minesota Mining and   Manufacturing')
  , ('Minesota Mining & Manufacturing   ')
  , (' tillamook')
  , ('MUTUAL OF OMAHA')
  , ('   ')
  ) q(s)

Results

s                                       (No column name)
Minesota Mining and   Manufacturing     Minesota Mining And Manufacturing
Minesota Mining & Manufacturing         Minesota Mining & Manufacturing
 tillamook                              Tillamook
MUTUAL OF OMAHA                         Mutual Of Omaha
    

The Function if Iterested

CREATE FUNCTION [dbo].[svf-Str-Proper] (@S varchar(max))
Returns varchar(max)
As
Begin
    Set @S = ' '+ltrim(rtrim(replace(replace(replace(lower(@S),' ','†‡'),'‡†',''),'†‡',' ')))+' '
    ;with cte1 as (Select * From (Values(' '),('-'),('/'),('\'),('['),('{'),('('),('.'),(','),('&'),(' Mc'),(' Mac'),(' O''') ) A(P))
         ,cte2 as (Select * From (Values('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('I'),('J'),('K'),('L'),('M')
                                       ,('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
                                       ,('LLC'),('PhD'),('MD'),('DDS'),('II'),('III'),('IV')
                                 ) A(S))
         ,cte3 as (Select F = Lower(A.P+B.S),T = A.P+B.S From cte1 A Cross Join cte2 B 
                   Union All 
                   Select F = Lower(B.S+A.P),T = B.S+A.P From cte1 A Cross Join cte2 B where A.P in ('&') 
                  ) 
    Select @S = replace(@S,F,T) From cte3 
    Return rtrim(ltrim(@S))
End
-- Syntax : Select [dbo].[svf-Str-Proper]('john cappelletti')
--          Select [dbo].[svf-Str-Proper]('james e. o''neil')
--          Select [dbo].[svf-Str-Proper]('CAPPELLETTI II,john old macdonald iv phd,dds llc b&o railroad bank-one at&t BD&I Bank-Five dr. Langdon,dds')
John Cappelletti
  • 79,615
  • 7
  • 44
  • 66
  • That looks pretty slick. It doesn't use XML and I think I can convert it to use directly in a view (not as a function). I can see that ideally you'd want cte1 and cte2 to be maintained externally to quickly accommodate newly-discovered cases in your data. Will this work when using a case insensitive collation? I don't currently have a need for all of this functionality and I'm looking for a more generic way to handle special characters. But this will probably be a better overall solution for what my data integration team has been doing with these types of values. – dougp Feb 09 '23 at 20:43
  • What kind of magic is this? cte3 has 462 rows, so `SELECT REPLACE(@S, F, T) FROM cte3` returns 462 rows. So how does @S get only the single, correct answer? – dougp Feb 09 '23 at 21:33
  • @dougp The final Select is a cascading replace – John Cappelletti Feb 09 '23 at 22:15
1

Please try the following solution.

It is using SQL Server XML, XQuery, and its FLWOR expression.

Notable points:

  • cast as xs:token? is taking care of the whitespaces, i.e:
    • All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
    • Then leading and trailing spaces are removed from the value.
    • Further, contiguous occurrences of more than one space will be replaced with a single space.
  • FLWOR expression is taking care of a proper case.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (tokens VARCHAR(MAX));
INSERT @tbl (tokens) VALUES
('mineSota Mining and   MaNufacturing'),
('Minesota Mining & Manufacturing   '),
(' tillamook'),
('MUTUAL  OF   OMAHA'),
('   ');
-- DDL and sample data population, end

DECLARE @separator CHAR(1) = SPACE(1);
SELECT t.*, scrubbed
    , result = c.query('
        for $x in /root/r/text()
        return concat(upper-case(substring($x,1,1)),lower-case(substring($x,2,1000)))
        ').value('text()[1]', 'VARCHAR(MAX)')
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<r><![CDATA[' + tokens + ' ' + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)')) AS t1(scrubbed)
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
      REPLACE(scrubbed, @separator, ']]></r><r><![CDATA[') + 
      ']]></r></root>' AS XML)) AS t2(c);

Output

tokens scrubbed result
mineSota Mining and MaNufacturing mineSota Mining and MaNufacturing Minesota Mining And Manufacturing
Minesota Mining & Manufacturing Minesota Mining & Manufacturing Minesota Mining & Manufacturing
tillamook tillamook Tillamook
MUTUAL OF OMAHA MUTUAL OF OMAHA Mutual Of Omaha
NULL
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
  • Exactly what I need for my project today. thanks – dougp Feb 09 '23 at 20:43
  • ...almost exactly. It's important to keep the multiple spaces that are not on the ends of the string, like `Minesota Mining And Manufacturing`. I'll need to try to adjust for that. – dougp Feb 09 '23 at 21:08
0

You've made the classic SQL Server XML mistake, one cannot just use the PATH(''). You have to do the convuluted PATH(''), TYPE).value('.', 'NVARCHAR(MAX)') thing to get proper encoded characters.

Here's your fixed version:

with T as (
  select *
  from (
  values ('Minesota Mining and   Manufacturing')
  , ('Minesota Mining & Manufacturing   ')
  , (' tillamook')
  , ('MUTUAL OF OMAHA')
  , ('   ')
  ) q(s)
),
scrubbed as (
  select T.s as InitialValue
  , CASE 
      WHEN LEN(RTRIM(T.s)) > 0
        THEN LTRIM(RTRIM(T.s))
    END as s
  from T
)
select distinct s.InitialValue
, stuff(
    (
      SELECT ' ' + t2.word
      from (
          select str.value
          , upper(substring(str.value, 1, 1)) + 
            case when len(str.value) > 1 then lower(substring(str.value, 2, len(str.value) - 1)) else '' end as word
          , charindex(' ' + str.value + ' ', ' ' + s.s + ' ') as idx
          from string_split(s.s, ' ') str
        ) t2
      order by t2.idx
      FOR XML PATH(''), TYPE
    ).value('.', 'NVARCHAR(MAX)'), 
    1, 
    1, 
    ''
  ) as INITCAP
from scrubbed s
siggemannen
  • 3,884
  • 2
  • 6
  • 24
  • Now that is exactly what I needed. It accurately reproduces `INITCAP()`, preserving all of the spaces. – dougp Feb 09 '23 at 23:01