SQL Server: How to find position of the first uppercase non-ascii character in the string?

Question

Generally speaking I need to search varchar(100) column for the pattern '%bA%', where

A - uppercase non-ascii character and
b - lowercase non-ascii character.

From the high level perspective I need to find all strings where [space] symbol is missed before the uppercase character, for example as a result of firstname and lastname columns concatenation without a space between them.

SQLFiddle environment for reproducing

-- WORKING (ASCII) - thanx to @Serpiton
create table #tmp (value varchar(100));
insert into #tmp (value) values ('JohnnyBravo'); -- expected output
insert into #tmp (value) values ('Johnny Bravo');
insert into #tmp (value) values ('Johnnybravo');
insert into #tmp (value) values ('johnnybravo');
select * from #tmp WHERE value collate Latin1_General_CS_AI like '%[a-z][^ abcdefghijklmnopqrstuvwxyz]%'
drop table #tmp;

-- NOT WORKING (NON-ASCII, Cyrillic, win-1251)
create table #tmp (value varchar(100));
insert into #tmp (value) values ('АндрейМорозов'); -- expected output
insert into #tmp (value) values ('Андрей Морозов');
insert into #tmp (value) values ('Андрейморозов');
insert into #tmp (value) values ('андрейморозов');
select * from #tmp WHERE value collate Cyrillic_General_CS_AI like '%[а-я][^ абвгдежзиклмнопрстуфхцчшщъыьэюя]%'
drop table #tmp;

-- UNICODE - NOT WORKING TOO
create table #tmp (value nvarchar(100) collate Cyrillic_General_CS_AI);
insert into #tmp (value) values (N'АндрейМорозов'); -- expected output
insert into #tmp (value) values (N'Андрей Морозов');
insert into #tmp (value) values (N'Андрейморозов');
insert into #tmp (value) values (N'андрейморозов');
select * from #tmp WHERE value like '%[а-я][^ абвгдежзиклмнопрстуфхцчшщъыьэюя]%'
drop table #tmp;

If you want to work with unicode data you have to tell SQL Server that's what you are doing. Your test data should be `create table #tmp (value Nvarchar(100) collate Cyrillic_General_CS_AI); insert into #tmp (value) values (N'АндрейМорозов');` Note the N's. — Michael Green, May 07 '14 at 10:35
@MichaelGreen thanx for the tip, I'm was trying that, but got the same (no) results — Andrey Morozov, May 07 '14 at 10:52
Let me know if the revised solution works for you. You will have to pad it out with all the Uppder case characters you need to deal with. — Michael Green, May 07 '14 at 10:57

score 2 · Accepted Answer · edited May 23 '17 at 12:32

If it really is just "firstname and lastname columns concatenation without a space between" then this may suffice:

select
    *
from mytable
where LEN(ConcatenatedName) <> LEN(REPLACE(ConcatenatedName, ' ', ''));

You can substitute whatever passes for a space in your local environment, of course. If there are, say, middle name which should be space-delimited this will pick them up too. However, you may get false positives if there are intended spaces e.g. two part surnames.

Second go

Fair point. I hadn't accounted for case. Here's some magic code which works with your test data:

with digits as
(
    SELECT
        *
    FROM
        (
        VALUES (0),(1), (2), (3), (4), (5), (6), (7), (8), (9)
        ) AS MyTable(i)
)
, Number as
(
    select (a.i * 10) + b.i as number
    from digits as a
    cross join digits as b
)
, LetterCase as
(
    select
        n.number
        ,t.value
        ,SUBSTRING(t.value, n.number, 1) as Letter
        ,ASCII(SUBSTRING(t.value, n.number, 1)) LetterASCII
        ,CASE
            when ASCII(SUBSTRING(t.value, n.number, 1)) between 65 and 90
                then 'True'
            else 'False'
        end as IsUpper
    from Number as n
    cross join #tmp as t
    where n.number between 1 and LEN(t.value)
)
select
    lc.value
from LetterCase as lc
where lc.IsUpper = 'True'
and lc.number > 1
and SUBSTRING(lc.value, lc.number - 1, 1) <> ' '

drop table #tmp;

It draws on my answer to this other question - Split words with a capital letter in sql.

Third go

Heres some magic code which works with your revised test data. You have to take care if your instance's (or database's or columns's) default collation is not the same as the one with which you want to work.

;with digits as
(
    SELECT
        *
    FROM
        (
        VALUES (0),(1), (2), (3), (4), (5), (6), (7), (8), (9)
        ) AS MyTable(i)
)
, Number as
(
    select (a.i * 10) + b.i as number
    from digits as a
    cross join digits as b
)
, UpperCaseCharacters as
(
    select NCHAR(1040) collate Cyrillic_General_CS_AI as CodePoint --А
    UNION ALL
    select NCHAR(1052) --М
    -- Extend this list with all the upper case character in your chosen glyph list.
)
, LetterCase as
(
    select
        n.number
        ,t.value
        ,CASE
            when SUBSTRING(t.value, n.number, 1) IN (select Codepoint from UpperCaseCharacters)
                then 'True'
            else ''
        end as IsUpper
    from Number as n
    cross join #tmp as t
    where n.number between 1 and LEN(t.value)
)
select
    lc.value
from LetterCase as lc
where lc.IsUpper = 'True'
and lc.number > 1
and SUBSTRING(lc.value, lc.number - 1, 1) <> ' ';

No it does not working - try to place your `where` statement inside code snippet from my post. — Andrey Morozov, May 05 '14 at 12:06
No it will not work in my case because I'm working with **non-ascii** characters. For example `ascii` function always return 63 for any `Cyrillic (win-1251)` character - check `select ascii('п')` — Andrey Morozov, May 06 '14 at 06:27
@Andrey - ASCII 63 is the query sign "?" - that's SQL Server's way of saying the given charachter cannot be represented in your current collation / code set. — Michael Green, May 07 '14 at 10:46
(+1) it seems that it works... going to try... will be back soon — Andrey Morozov, May 07 '14 at 11:00

Serpiton · Answer 2 · 2014-05-05T16:02:03.573

1

You can use the basic pattern recognition of the LIKE operator.
The most used capability is the %, but it recognize set search with [set].
The pattern you search is

WHERE column LIKE '%[^ ][A-Z]%'

Where the character ^ is used to negate the following pattern.

EDIT
Checking the comment from the OP I found out that SQLFiddle use a case insensite collation, with that the check will never work, still I had to change the logic to "a lower case letter followed by something that is not a lower case letter or a space", but to do this the second range have to be expanded to add the space. For english letter is

WHERE column LIKE '%[a-z][^ abcdefghijklmnopqrstuvwxyz]%'

SQLFiddle demo

edited May 05 '14 at 16:02

answered May 05 '14 at 10:35

Serpiton

3,676
3
24
35

In my case I working with NON-ASCII characters (codepage Win-1251) and A-Z pattern is not applicable for me. All characters from my local codepage is suitable form [^A-Z] – Andrey Morozov May 05 '14 at 10:40
any non-ascii character has code higher then `Z`, check `select ascii('а')` – Andrey Morozov May 05 '14 at 10:56
Ok, thanx, (+1) now it working with ascii charatres - but still no results with non-ascii - but that's progress :) – Andrey Morozov May 06 '14 at 06:31

score 0 · Answer 3 · answered May 05 '14 at 10:34

0

select * from <Table_name> where <Field_name> COLLATE Latin1_General_CS_AS like '%bA%'

answered May 05 '14 at 10:34

Anto Raja Prakash

1,328
8
12

Andrey Morozov · Answer 4 · 2014-05-07T12:17:36.670

The final solution in my case is TVF

alter function FindUpperCase(@value nvarchar(100) )
returns table
as return 
(
    with tmp as
    (
        select * from (values (@value)) as t(value)
    )
    , digits as
    (
        SELECT * FROM
            (
            VALUES (0),(1), (2), (3), (4), (5), (6), (7), (8), (9)
            ) AS MyTable(i)
    )
    , Number as
    (
        select (a.i * 10) + b.i as number
        from digits as a
        cross join digits as b
    )
    , LetterCase as
    (
        select
            n.number
            ,t.value
            ,CASE
                when UNICODE(SUBSTRING(t.value, n.number, 1)) BETWEEN UNICODE(N'А') AND UNICODE(N'Я')
                    then 'True'
                else ''
            end as IsUpper
        from Number as n
        cross join tmp as t
        where n.number between 1 and LEN(t.value)
    )
    select
        lc.value
    from LetterCase as lc
    where lc.IsUpper = 'True'
    and lc.number > 1
    and UNICODE(SUBSTRING(lc.value, lc.number - 1, 1)) BETWEEN UNICODE(N'а') AND UNICODE(N'я')
)

Usage

select c.fullname
from dbo.contacts c
cross apply dbo.FindUpperCase(c.fullname) as f

SQL Server: How to find position of the first uppercase non-ascii character in the string?

4 Answers4

Usage