0

I want to find ascii strings in oracle query which have symbols more than chr(127)

I see a lot of suggestions that '['||chr(128)||'-'||chr(255)||']' must work, but it doesn't

so next must return OK, but it doesn't

select 'OK' as result from dual where regexp_like('why Ä ?', '['||chr(128)||'-'||chr(255)||']')

and next must not return OK, but it does

select 'OK' as result from dual where regexp_like('why - ?', '['||chr(128)||'-'||chr(255)||']')

UPD: Sorry, capital A umlaut in my case is \xC4 (ISO 8859 Latin 1) , but here it turns into unicode chr(50052)

Dmitry Perfilyev
  • 536
  • 1
  • 4
  • 14
  • You are assuming that "more than 127" means "between 128 and 255". That assumption is just wrong. There are many characters (the vast majority, actually) with code point greater than 255. For example, capital A umlaut (in your first string) is `chr(50052)`. In your second string, which character do you think is **not** below `chr(127)`? –  Aug 07 '21 at 20:54
  • Regarding character range, this one might be interesting for you: https://stackoverflow.com/questions/50914930/find-out-if-a-string-contains-only-ascii-characters – Wernfried Domscheit Aug 07 '21 at 21:27
  • What is the value of `NLS_CHARACTERSET`? For multibyte charset there will be code points above 255. But you may consider `TRANSLATE` function to remove all the characters before 127 and check the length of the remaining string. Or use `not regexp_like(` – astentx Aug 07 '21 at 21:43
  • 1
    What exactly are you trying to achieve? Why do you use codepoints? – Sayan Malakshinov Aug 08 '21 at 00:25

3 Answers3

2

How about a different approach? Split string into characters and check whether maximum value is higher than 127.

For example:

SQL> with test (col) as
  2    (select 'why Ä ?' from dual)
  3  select substr(col, level, 1) one_character,
  4         ascii(substr(col, level, 1)) ascii_of_one_character
  5  from test
  6  connect by level <= length(col);

ONE_ ASCII_OF_ONE_CHARACTER
---- ----------------------
w                       119
h                       104
y                       121
                         32
Ä                     50621         --> here it is!
                         32
?                        63

7 rows selected.

SQL>

Now, move it into a subquery and fetch the result:

SQL> with test (col) as
  2    (select 'why Ä ?' from dual)
  3  select case when max(ascii_of_one_character) > 127 then 'OK'
  4              else 'Not OK'
  5         end result
  6  from (select substr(col, level, 1) one_character,
  7          ascii(substr(col, level, 1)) ascii_of_one_character
  8        from test
  9        connect by level <= length(col)
 10       );

RESULT
------
OK

Or:

SQL> with test (col) as
  2    (select 'why - ?' from dual)
  3  select case when max(ascii_of_one_character) > 127 then 'OK'
  4              else 'Not OK'
  5         end result
  6  from (select substr(col, level, 1) one_character,
  7          ascii(substr(col, level, 1)) ascii_of_one_character
  8        from test
  9        connect by level <= length(col)
 10       );

RESULT
------
Not OK

Millions of rows? Well, even for two rows queries I posted wouldn't work properly. Switch to

SQL> with test (col) as
  2    (select 'why - ?' from dual union all
  3     select 'why Ä ?' from dual
  4    )
  5  select col,
  6         case when max(ascii_of_one_character) > 127 then 'OK'
  7              else 'Not OK'
  8         end result
  9  from (select col,
 10               substr(col, column_value, 1) one_character,
 11               ascii(substr(col, column_value, 1)) ascii_of_one_character
 12        from test cross join table(cast(multiset(select level from dual
 13                                                 connect by level <= length(col)
 14                                                ) as sys.odcinumberlist))
 15       )
 16  group by col;

COL      RESULT
-------- ------
why - ?  Not OK
why Ä ?  OK

SQL>

How will it behave? I don't know, try it and tell us. Note that for large data sets regular expressions might actually be slower than a simple substr option.


Yet another option: how about TRANSLATE? You don't have to split anything in that case. For example:

SQL> with test (col) as
  2    (select 'why - ?' from dual union all
  3     select 'why Ä ?' from dual
  4    )
  5  select col,
  6         case when nvl(length(res), 0) > 0 then 'OK'
  7              else 'Not OK'
  8         end result
  9  from (select col,
 10        translate
 11        (col,
 12         '!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ',
 13         '!') res
 14        from test
 15       );

COL      RESULT
-------- ------
why - ?  Not OK
why Ä ?  OK

SQL>
Littlefoot
  • 131,892
  • 15
  • 35
  • 57
  • Interesting approach, I'll try it, but I have millions of rows with rather long lines , so I'm afraid splitting every row to hundreds rows kill my oracle and DBA kill me afterwards – Dmitry Perfilyev Aug 07 '21 at 21:15
  • Right; well, you never said anything about number of rows. I edited the answer as query should be modified in such a case. Have a look, please. Oh, yes - which flowers do you like? – Littlefoot Aug 07 '21 at 21:21
  • Yet another option, @Dmitry - TRANSLATE. Check edited answer. – Littlefoot Aug 07 '21 at 21:33
  • Don't understand about flowers, but your answers seems to be work in my case. Sorry, I'll check them more carefully on Monday – Dmitry Perfilyev Aug 07 '21 at 21:56
  • When DBA kills you, which flowers should I bring to the funeral? Oh, killing is not an option any more? Well, no flowers for you, then, sorry. – Littlefoot Aug 07 '21 at 22:10
  • The `translate` approach is almost correct, but it will give the wrong answer if the input contains exclamation points, since those are not removed from the string. Use a character outside the desired range instead (for example a-umlaut). Also the test can be simpler - if the result of the `translate` operation is null or not null (no need to look at the length). –  Aug 08 '21 at 05:52
  • `TRANSLATE` version is not suitable for me, as I need to ignore ALL chars < 127 . Some of the strings has chr(17) or chr(7). But I need to find only those strings which have at least one char > 127. Any other chars - 0,1,2,3,....,126,127 - must be ignored. – Dmitry Perfilyev Aug 08 '21 at 10:58
0

Don't know why you want to use codepoints instead of character sets, but you can invert the logic - use not 1-127 - [^1-127] : DBFiddle

select 'OK' as result 
from dual 
where regexp_like('why Ä ?', '[^'||chr(1)||'-'||chr(127)||']');
select regexp_substr('why Ä ?', '[^'||chr(1)||'-'||chr(127)||']') x from dual;

And do not forget that some characters can be special characters like ] or even non-printable

FractalLotus
  • 350
  • 2
  • 11
Sayan Malakshinov
  • 8,492
  • 1
  • 19
  • 27
  • are you check your answer? replace "'why Ä ?'" to simple "why" and you see OK again. And that is my original question. – Dmitry Perfilyev Aug 08 '21 at 10:39
  • @DmitryPerfilyev are you sure? https://dbfiddle.uk/?rdbms=oracle_18&fiddle=2707add143c814e5191bd7203f2b22c3 – Sayan Malakshinov Aug 08 '21 at 12:41
  • yes, I'm sure , but I do this queries in SQLDeveloper Version 4.1.3.20 My results: 1 query: OK 2 query: OK 3 query: w – Dmitry Perfilyev Aug 08 '21 at 21:03
  • @DmitryPerfilyev configure your nls settings properly and show them, also retry this example in properly configured sql*plus and show what does return this query: `select dump('why Ä',1016) dmp from dual` – Sayan Malakshinov Aug 09 '21 at 00:34
0

There is also another approach:

with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select 
  str,
  case 
     when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']') 
       then 'Ok' 
       else 'Not ok' 
  end as res,
  xmlcast(
    xmlquery(
       'count(string-to-codepoints(.)[. > 127])' 
       passing t.str 
       returning content)
    as int) cnt_over_127
from t;

Results:

STR        RES    CNT_OVER_127
---------- ------ ------------
why Ä ?    Ok                1
why - ?    Not ok            0
why - ? Ä  Ok                1
why        Not ok            0

As you can see I've used xmlquery() with string-to-codepoints xpath function, then filtered out codepoints >127 and returned their count().

Also you can use dump or utl_raw.cast_to_raw() functions, but it's a bit more complex and I'm a bit lazy to write full solutions using them. But just small draft:

with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select 
  str,
  case 
     when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']') 
       then 'Ok' 
       else 'Not ok' 
  end as res,
  dump(str,1016) dmp,
  dump(str,1015) dmp,
  utl_raw.cast_to_raw(str) as_row,
  regexp_count(dump(str,1016)||',', '[89a-f][0-9a-f],') xs
from t;

Results:

STR        RES    DMP                                                                 DMP                                                                     AS_ROW               XS
---------- ------ ------------------------------------------------------------------- ----------------------------------------------------------------------- -------------------- --
why Ä ?    Ok     Typ=1 Len=8 CharacterSet=AL32UTF8: 77,68,79,20,c3,84,20,3f          Typ=1 Len=8 CharacterSet=AL32UTF8: 119,104,121,32,195,132,32,63         77687920C384203F      2
why - ?    Not ok Typ=1 Len=7 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f             Typ=1 Len=7 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63              776879202D203F        0
why - ? Ä  Ok     Typ=1 Len=10 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f,20,c3,84   Typ=1 Len=10 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63,32,195,132  776879202D203F20C384  2
why        Not ok Typ=1 Len=3 CharacterSet=AL32UTF8: 77,68,79                         Typ=1 Len=3 CharacterSet=AL32UTF8: 119,104,121                          776879                0

Note: as that is unicode, so the first byte >127 means that is a multibyte character, so it counts 'Ä' twice - c3,84, - both bytes are higher than 127.

Sayan Malakshinov
  • 8,492
  • 1
  • 19
  • 27