Finding and removing Non-ASCII characters from an Oracle Varchar2

Question

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.

I would like to do :

SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')

but Oracle does not implement the [:ascii:] character class.

Is there a simple way doing what I want to do?

You'd probably want to replace `ç` by `c` etcetera. Throwing the entire character away is worse than discarding the diacritical mark. — MSalters, Feb 10 '10 at 12:02
We 1st need to find out what the characters are before deciding what to do with them. — Paul Gilfedder, Feb 10 '10 at 12:15
This is a destructive process and would you want to preserve with ascii replacements of some characters? dashes, single quotes, double quotes, etc? unistr 0013 -, 0018 ', 0019 ', 001C ", 001D " — Sun, Mar 02 '22 at 18:58

Iurii Tkachenko · Answer 1 · 2016-06-21T12:03:10.373

31

I think this will do the trick:

SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')

edited Jun 21 '16 at 12:03

answered Nov 19 '15 at 08:08

Iurii Tkachenko

3,106
29
34

8

This is neat and works well. Just as an adendum you can also use REGEXP_REPLACE(Column,'[^ -~]','') rather than all those Chr() functions and string concatenations mentioned above. – Ciarán Sep 02 '16 at 12:25
12

@Ciaran: `REGEXP_REPLACE(Column,'[^ -~]','')` is excellent, since Oracle doesn't support `'[\x80-\xFF]'`. Should be in an answer. – mivk Dec 07 '16 at 14:55
@Ciarán Works for me thanks. What does -~ actually mean? Is this standard regex or something unique to Oracle? – Learner Feb 24 '17 at 06:34
4

If you want to **keep new lines**, use `regexp_replace(column, '[^ -~|[:space:]]', '')` – Learner Feb 24 '17 at 06:34
5

It specifies an ascii character range, i.e. Space (character 32) - (to) tilda "~" (character 126) http://www.asciitable.com – Ciarán Mar 03 '17 at 08:09
Expression `[ -~]` may fail, see https://stackoverflow.com/questions/50914930/find-out-if-a-string-contains-only-ascii-characters – Wernfried Domscheit Sep 13 '19 at 06:22
1

This is a good start, but there are plenty of characters in the "print" class that are not found/removed. This definitely got me going down the right track, so thank you for adding this! – JonathanDavidArndt Oct 06 '21 at 14:15

score 26 · Answer 2 · edited Feb 10 '12 at 08:03

26

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')

...where field and table are your field and table names respectively.

edited Feb 10 '12 at 08:03

stealthyninja

10,343
11
51
59

answered Jul 29 '11 at 22:42

Robb Smith

261
3
2

2

If the length of the string is close to 4000 then `ASCIISTR()` will extend the string beyond this limit and the string will be truncated to 4000 characters (losing the excess characters from the end). [SQLFIDDLE](http://sqlfiddle.com/#!4/224ee/13) – MT0 Nov 02 '15 at 01:32
2

This picks up the backslash character as well which is not desirable as it is ascii – rtaft Jan 28 '19 at 14:11
To code around this, `where replace(asciistr(field),asciistr('\'),'\') <> field` – Juraj Jun 07 '19 at 20:18

score 11 · Answer 3 · answered Aug 14 '13 at 14:34

11

I wouldn't recommend it for production code, but it makes sense and seems to work:

SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')

answered Aug 14 '13 at 14:34

Francisco Hayoz

111
1
2

1

Note that you should normally start at 32 instead of 1, since that is the first printable ascii character. The rest are control characters, which would be weird inside text columns (even weirder than >127 I'd say). But yeah technically the answer is correct, this would detect non-ascii characters, given the original 7-bit ascii standard. – Luc Jan 05 '14 at 04:19

score 8 · Answer 4 · edited Dec 12 '12 at 20:20

8

The select may look like the following sample:

select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)  
order by nvalue;

edited Dec 12 '12 at 20:20

Yusubov

5,815
9
32
69

answered Sep 09 '10 at 21:34

Jeff Dwight

81
1
1

1

Good idea, but with this you are actually identifying fields having data where the size in bytes is not the same of the number of the symbols represented by them. – lucabelluccini Aug 12 '14 at 15:11
1

Also incorrectly returns the "\" key as a non ascii character. – Mike Samaras Aug 24 '18 at 11:09

score 7 · Accepted Answer · answered Feb 10 '10 at 11:44

7

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

answered Feb 10 '10 at 11:44

Max Shawabkeh

37,799
10
82
91

1

I tried using the hex codes as suggested however:- regexp_replace(column,'[\x00-\xFF]','') Removes nothing by the Capital letters -- do I have escape something or is there something else I need to do? – Paul Gilfedder Feb 10 '10 at 13:19
3

I run into issues using your solution. This answer has been accepted so I believe it is not outright wrong but 1.) oracle does not support the regex syntax to specify code points/characters by their hex representation (ie.'\x80'); instead you have to specify the characters themselves ( however, the regex pattern is a string expression so you may use something like `'['||chr(128)||'-'||chr(255)||']'`), 2.) trying to replace all chars in `'['||chr(32)||'-'||chr(127)||']'` results in an ora-12728 error (invalid range in regex). my db charset is al32utf8. any ideas? – collapsar Mar 27 '15 at 12:56
1

I should add that 1.) the DB is oracle 11.2.0.3.0, 2.) the ranges 32-122, 32-255 do not cause the error but 3.) applied to a string composed of mixed-case alphabet letters and digits show inverse behaviour to what you expect (ie. `REGEXP_REPLACE ( 'abc', '['||chr(32)||'-'||chr(128)||']' , '_' )` produces `abc`, while `REGEXP_REPLACE ( 'abc', '[^'||chr(32)||'-'||chr(128)||']' , '_' )` returns `___` ). – collapsar Mar 27 '15 at 13:01

score 3 · Answer 6 · answered Feb 10 '10 at 14:15

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

Create a PLSQL function to receive your input string and return a varchar2.

In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

score 3 · Answer 7 · answered Apr 16 '20 at 07:13

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.

I tried this from an excellent comment

'[^ -~|[:space:]]'

but got this ORA-12728: invalid range in regular expression .

but it lead me to my solution:

select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from  
    (select '-   <- strangest thing here, and I want to keep line break after
-' deta from dual ) t

displays (in my TOAD tool) as

replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)

score 3 · Answer 8 · edited Feb 10 '12 at 08:04

3

The following also works:

select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
          CONVERT (
             '3735844533120%$03  ',
             'US7ASCII',
             'WE8ISO8859P1'),
          '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
  FROM DUAL);

edited Feb 10 '12 at 08:04

stealthyninja

10,343
11
51
59

answered Jan 19 '12 at 18:32

Sajid

31
1

score 2 · Answer 9 · edited Apr 24 '12 at 03:03

2

Try the following:

-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')

-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual

edited Apr 24 '12 at 03:03

Josh Darnell

11,304
9
38
66

answered Apr 24 '12 at 02:27

Kok-Yan Lo

21
1

score 2 · Answer 10 · answered Feb 10 '10 at 22:29

I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

select dump(a,1016), a, b
from
 (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
         COLUMN b
  from TABLE)
where a is not null
order by a;

I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

score 2 · Answer 11 · answered Nov 22 '16 at 10:55

2

You can try something like following to search for the column containing non-ascii character :

select * from your_table where your_col <> asciistr(your_col);

answered Nov 22 '16 at 10:55

Shardul Dhanorkar

31
1
4

score 2 · Answer 12 · answered Aug 09 '11 at 20:47

I found the answer here:

http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html

CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/

Then run this to update your data

update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);

score 1 · Answer 13 · edited Apr 16 '20 at 06:56

1

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))

I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

edited Apr 16 '20 at 06:56

J. Chomel

8,193
15
41
69

answered Aug 29 '14 at 20:25

allen

31
1
3

score 0 · Answer 14 · edited Apr 23 '14 at 00:32

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.

Here is the simple test in Oracle 11.2.03

select s
     , regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
     , dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
  from (select 127+level l,chr(127+level) c from dual connect by level < 129))

And "rep 127-255" is

Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

score 0 · Answer 15 · answered Oct 31 '15 at 21:51

Please note that whenever you use

regexp_like(column, '[A-Z]')

Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

å => a
ä => a
ö => o

of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

score 0 · Answer 16 · answered Oct 06 '21 at 14:12

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:

SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;

This will replace anything outside that printable range as a question mark.

This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

score -2 · Answer 17 · edited Aug 04 '15 at 22:31

-2

Do this, it will work.

trim(replace(ntwk_slctor_key_txt, chr(0), ''))

edited Aug 04 '15 at 22:31

josliber

43,891
12
98
133

answered Aug 04 '15 at 22:17

Mohan

1

1

Welcome to Stack Overflow! This answer turned up in the low quality review queue, presumably because you didn't explain the code. If you do explain it (in your answer), you are far more likely to get more upvotes—and the questioner is more likely to learn something! – The Guy with The Hat Aug 04 '15 at 23:22

score -3 · Answer 18 · edited Nov 06 '15 at 08:25

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

SELECT est.clients_ref
  ,TRANSLATE (
              est.clients_ref
             ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
              || REPLACE (
                          TRANSLATE (
                                     est.clients_ref
                                    ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                    ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                    )
                         ,'~'
                         )
             ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
             )
      clean_ref

FROM edms_staging_table est

Finding and removing Non-ASCII characters from an Oracle Varchar2

18 Answers18

Linked

Related