There is also another approach:
with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select
str,
case
when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']')
then 'Ok'
else 'Not ok'
end as res,
xmlcast(
xmlquery(
'count(string-to-codepoints(.)[. > 127])'
passing t.str
returning content)
as int) cnt_over_127
from t;
Results:
STR RES CNT_OVER_127
---------- ------ ------------
why Ä ? Ok 1
why - ? Not ok 0
why - ? Ä Ok 1
why Not ok 0
As you can see I've used xmlquery()
with string-to-codepoints
xpath function, then filtered out codepoints >127 and returned their count()
.
Also you can use dump
or utl_raw.cast_to_raw()
functions, but it's a bit more complex and I'm a bit lazy to write full solutions using them.
But just small draft:
with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select
str,
case
when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']')
then 'Ok'
else 'Not ok'
end as res,
dump(str,1016) dmp,
dump(str,1015) dmp,
utl_raw.cast_to_raw(str) as_row,
regexp_count(dump(str,1016)||',', '[89a-f][0-9a-f],') xs
from t;
Results:
STR RES DMP DMP AS_ROW XS
---------- ------ ------------------------------------------------------------------- ----------------------------------------------------------------------- -------------------- --
why Ä ? Ok Typ=1 Len=8 CharacterSet=AL32UTF8: 77,68,79,20,c3,84,20,3f Typ=1 Len=8 CharacterSet=AL32UTF8: 119,104,121,32,195,132,32,63 77687920C384203F 2
why - ? Not ok Typ=1 Len=7 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f Typ=1 Len=7 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63 776879202D203F 0
why - ? Ä Ok Typ=1 Len=10 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f,20,c3,84 Typ=1 Len=10 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63,32,195,132 776879202D203F20C384 2
why Not ok Typ=1 Len=3 CharacterSet=AL32UTF8: 77,68,79 Typ=1 Len=3 CharacterSet=AL32UTF8: 119,104,121 776879 0
Note: as that is unicode, so the first byte >127 means that is a multibyte character, so it counts 'Ä' twice - c3,84,
- both bytes are higher than 127.