35

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.

I would like to do :

SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')

but Oracle does not implement the [:ascii:] character class.

Is there a simple way doing what I want to do?

JonathanDavidArndt
  • 2,518
  • 13
  • 37
  • 49
Paul Gilfedder
  • 553
  • 1
  • 5
  • 16
  • You'd probably want to replace `ç` by `c` etcetera. Throwing the entire character away is worse than discarding the diacritical mark. – MSalters Feb 10 '10 at 12:02
  • 2
    We 1st need to find out what the characters are before deciding what to do with them. – Paul Gilfedder Feb 10 '10 at 12:15
  • This is a destructive process and would you want to preserve with ascii replacements of some characters? dashes, single quotes, double quotes, etc? unistr 0013 -, 0018 ', 0019 ', 001C ", 001D " – Sun Mar 02 '22 at 18:58

18 Answers18

31

I think this will do the trick:

SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
Iurii Tkachenko
  • 3,106
  • 29
  • 34
  • 8
    This is neat and works well. Just as an adendum you can also use REGEXP_REPLACE(Column,'[^ -~]','') rather than all those Chr() functions and string concatenations mentioned above. – Ciarán Sep 02 '16 at 12:25
  • 12
    @Ciaran: `REGEXP_REPLACE(Column,'[^ -~]','')` is excellent, since Oracle doesn't support `'[\x80-\xFF]'`. Should be in an answer. – mivk Dec 07 '16 at 14:55
  • @Ciarán Works for me thanks. What does -~ actually mean? Is this standard regex or something unique to Oracle? – Learner Feb 24 '17 at 06:34
  • 4
    If you want to **keep new lines**, use `regexp_replace(column, '[^ -~|[:space:]]', '')` – Learner Feb 24 '17 at 06:34
  • 5
    It specifies an ascii character range, i.e. Space (character 32) - (to) tilda "~" (character 126) http://www.asciitable.com – Ciarán Mar 03 '17 at 08:09
  • Expression `[ -~]` may fail, see https://stackoverflow.com/questions/50914930/find-out-if-a-string-contains-only-ascii-characters – Wernfried Domscheit Sep 13 '19 at 06:22
  • 1
    This is a good start, but there are plenty of characters in the "print" class that are not found/removed. This definitely got me going down the right track, so thank you for adding this! – JonathanDavidArndt Oct 06 '21 at 14:15
26

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')

...where field and table are your field and table names respectively.

stealthyninja
  • 10,343
  • 11
  • 51
  • 59
Robb Smith
  • 261
  • 3
  • 2
  • 2
    If the length of the string is close to 4000 then `ASCIISTR()` will extend the string beyond this limit and the string will be truncated to 4000 characters (losing the excess characters from the end). [SQLFIDDLE](http://sqlfiddle.com/#!4/224ee/13) – MT0 Nov 02 '15 at 01:32
  • 2
    This picks up the backslash character as well which is not desirable as it is ascii – rtaft Jan 28 '19 at 14:11
  • To code around this, `where replace(asciistr(field),asciistr('\'),'\') <> field` – Juraj Jun 07 '19 at 20:18
11

I wouldn't recommend it for production code, but it makes sense and seems to work:

SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
Francisco Hayoz
  • 111
  • 1
  • 2
  • 1
    Note that you should normally start at 32 instead of 1, since that is the first printable ascii character. The rest are control characters, which would be weird inside text columns (even weirder than >127 I'd say). But yeah technically the answer is correct, this would detect non-ascii characters, given the original 7-bit ascii standard. – Luc Jan 05 '14 at 04:19
8

The select may look like the following sample:

select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)  
order by nvalue;
Yusubov
  • 5,815
  • 9
  • 32
  • 69
Jeff Dwight
  • 81
  • 1
  • 1
7

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • 1
    I tried using the hex codes as suggested however:- regexp_replace(column,'[\x00-\xFF]','') Removes nothing by the Capital letters -- do I have escape something or is there something else I need to do? – Paul Gilfedder Feb 10 '10 at 13:19
  • 3
    I run into issues using your solution. This answer has been accepted so I believe it is not outright wrong but 1.) oracle does not support the regex syntax to specify code points/characters by their hex representation (ie.'\x80'); instead you have to specify the characters themselves ( however, the regex pattern is a string expression so you may use something like `'['||chr(128)||'-'||chr(255)||']'`), 2.) trying to replace all chars in `'['||chr(32)||'-'||chr(127)||']'` results in an ora-12728 error (invalid range in regex). my db charset is al32utf8. any ideas? – collapsar Mar 27 '15 at 12:56
  • 1
    I should add that 1.) the DB is oracle 11.2.0.3.0, 2.) the ranges 32-122, 32-255 do not cause the error but 3.) applied to a string composed of mixed-case alphabet letters and digits show inverse behaviour to what you expect (ie. `REGEXP_REPLACE ( 'abc', '['||chr(32)||'-'||chr(128)||']' , '_' )` produces `abc`, while `REGEXP_REPLACE ( 'abc', '[^'||chr(32)||'-'||chr(128)||']' , '_' )` returns `___` ). – collapsar Mar 27 '15 at 13:01
3

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

Create a PLSQL function to receive your input string and return a varchar2.

In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

Jim Hudson
  • 7,959
  • 2
  • 23
  • 15
3

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.

I tried this from an excellent comment

'[^ -~|[:space:]]'

but got this ORA-12728: invalid range in regular expression .

but it lead me to my solution:

select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from  
    (select '-   <- strangest thing here, and I want to keep line break after
-' deta from dual ) t

displays (in my TOAD tool) as

in my toad tool

  • replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
J. Chomel
  • 8,193
  • 15
  • 41
  • 69
3

The following also works:

select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
          CONVERT (
             '3735844533120%$03  ',
             'US7ASCII',
             'WE8ISO8859P1'),
          '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
  FROM DUAL);
stealthyninja
  • 10,343
  • 11
  • 51
  • 59
Sajid
  • 31
  • 1
2

Try the following:

-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')

-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
Josh Darnell
  • 11,304
  • 9
  • 38
  • 66
Kok-Yan Lo
  • 21
  • 1
2

I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

select dump(a,1016), a, b
from
 (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
         COLUMN b
  from TABLE)
where a is not null
order by a;

I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

Gary Myers
  • 34,963
  • 3
  • 49
  • 74
2

You can try something like following to search for the column containing non-ascii character :

select * from your_table where your_col <> asciistr(your_col);
2

I found the answer here:

http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html

CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/

Then run this to update your data

update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
1

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))

I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

J. Chomel
  • 8,193
  • 15
  • 41
  • 69
allen
  • 31
  • 1
  • 3
0

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.

Here is the simple test in Oracle 11.2.03

select s
     , regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
     , dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
  from (select 127+level l,chr(127+level) c from dual connect by level < 129))

And "rep 127-255" is

Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

demongolem
  • 9,474
  • 36
  • 90
  • 105
Alex S
  • 94
  • 1
  • 1
0

Please note that whenever you use

regexp_like(column, '[A-Z]')

Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

  • å => a
  • ä => a
  • ö => o

of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

elwood
  • 129
  • 5
0

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:

SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;

This will replace anything outside that printable range as a question mark.

This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

JonathanDavidArndt
  • 2,518
  • 13
  • 37
  • 49
-2

Do this, it will work.

trim(replace(ntwk_slctor_key_txt, chr(0), ''))
josliber
  • 43,891
  • 12
  • 98
  • 133
Mohan
  • 1
  • 1
    Welcome to Stack Overflow! This answer turned up in the low quality review queue, presumably because you didn't explain the code. If you do explain it (in your answer), you are far more likely to get more upvotes—and the questioner is more likely to learn something! – The Guy with The Hat Aug 04 '15 at 23:22
-3

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

SELECT est.clients_ref
  ,TRANSLATE (
              est.clients_ref
             ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
              || REPLACE (
                          TRANSLATE (
                                     est.clients_ref
                                    ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                    ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                    )
                         ,'~'
                         )
             ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
             )
      clean_ref

FROM edms_staging_table est

Bowdzone
  • 3,827
  • 11
  • 39
  • 52