4

I have a field in a table which holds XML entities for special characters, since the table is in latin-1. E.g. "Hallöle slovenčina" (the "ö" is in latin-1, but the "č" in "slovenčina" had to be converted to an entity by some application that stores the values into the database)

Now I need to export the table into a utf-8 encoded file by converting the XML entities to their original characters.

Is there a function in Oracle that might handle this for me, or do I really need to create a huge key/value map for that?

Any help is greatly appreciated.

EDIT: I found the function DBMS_XMLGEN.convert, but it only works on <,> and &. Not on &#NNN; :-(

mawimawi
  • 4,222
  • 3
  • 33
  • 52
  • in [this](http://www.pilcrow.nl/2011/04/unescape-html-entities-in-oracle) link dbms_xmlgen.convert converted the "'". Maybe because is a xml table instead of varchar2? –  Feb 06 '12 at 23:46
  • Sérgio, it is definitely a varchar2 field: CREATE TABLE mytable ( tid INTEGER NOT NULL , zitat VARCHAR2 (2000) NOT NULL ); – mawimawi Feb 07 '12 at 17:26

3 Answers3

7

You can also just use the internationalization package :

UTL_I18N.unescape_reference ('text')

Works great in changing those html entities to normal characters (such as cleanup after moving a database from iso 8859P1 to UTF-8)

Stephan
  • 41,764
  • 65
  • 238
  • 329
Olafur Tryggvason
  • 4,761
  • 24
  • 30
7

I believe the problem with dbms_xmlgen is that there are technically only five XML entities. Your example has a numeric HTML entity, which corresponds with Unicode:

http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=0100&end=017F

Oracle has a function UNISTR which is helpful here:

select unistr('sloven\010dina') from dual;

I've converted 269 to its hex equivalent 010d in the example above (in Unicode it is U+010D). However, you could pass a decimal number and do a conversion like this:

select unistr('sloven\' || replace(to_char(269, 'xxx'), ' ', '0') || 'ina') from dual;

EDIT: The PL/SQL solution:

Here's an example I've rigged up for you. This should loop over and replace any occurrences for each row you select out of your table(s).

create table html_entities (
    id NUMBER(3),
    text_row VARCHAR2(100)
);

INSERT INTO html_entities 
VALUES (1, 'Hallöle sloven&#269;ina &#266; &#250;');

INSERT INTO html_entities 
VALUES (2, 'I like the letter &#266;');

INSERT INTO html_entities 
VALUES (3, 'Nothing to change here.');

DECLARE
    v_replace_str NVARCHAR2(1000);
    v_fh UTL_FILE.FILE_TYPE;       
BEGIN
    --v_fh := utl_file.fopen_nchar(LOCATION IN VARCHAR2, FILENAME IN VARCHAR2, OPEN_MODE IN VARCHAR2, MAX_LINESIZE IN BINARY_INTEGER);

    FOR v_rec IN (select id, text_row from html_entities) LOOP
        v_replace_str := v_rec.text_row;
        WHILE (REGEXP_INSTR(v_replace_str, '&#[0-9]+;') <> 0) LOOP
            v_replace_str := REGEXP_REPLACE(
                v_replace_str, 
                '&#([0-9]+);',
                unistr('\' || replace(to_char(to_number(regexp_replace(v_replace_str, '.*?&#([0-9]+);.*$', '\1')), 'xxx'), ' ', '0')),
                1,
                1
            );
        END LOOP;

        -- utl_file.put_line_nchar(v_fh, v_replace_str);
        dbms_output.put_line(v_replace_str);

    END LOOP;
    --utl_file.fclose(v_fh);
END;
/

Notice that I've stubbed in calls to the UTL_FILE function to write NVARCHAR lines (Oracle's extended character set) to a file on the database server. The dbms_output, while great for debugging, doesn't seem to support extended characters, but this shouldn't be a problem if you use UTL_FILE to write to a file. Here's the DBMS_OUTPUT:

Hallöle slovencina C ú
I like the letter C
Nothing to change here.
Dan A.
  • 2,914
  • 18
  • 23
  • "unistr" looks like an interesting function, but since I have many rows in my table, with entities for many different characters, all of them in the format N; (where N stands for one or more decimal digits), this approach does not look feasible to me. – mawimawi Feb 07 '12 at 17:06
  • @mawimawi - I realize this isn't a full solution, but it's a good start. I see that MK has expanded on my solution with regular expressions. If `REGEXP_REPLACE` does not have a global replace function (it would be a shame if there isn't), you could use a stored procedure - loop through a cursor, and on each row loop until you no longer find a decimal entity using the `REGEXP_INSTR` function as well as `REGEXP_REPLACE`. If I have time later and you still do not have a solution, I could write a code sample. – Dan A. Feb 07 '12 at 17:48
  • if you find the time, that will be a godsend to me. – mawimawi Feb 08 '12 at 16:30
  • Thank you for your help Danimal37! Looks fantastic! – mawimawi Feb 08 '12 at 19:56
  • And thank you! My humble reputation has just taken a big jump! :) – Dan A. Feb 08 '12 at 20:16
  • With **read-only** rights I am trying to run this regex-replace exercise on a `SELECT {reg-ex...} as "corrected values" FROM ... ` as a one-liner, that is, I am *not* trying to change the values in the source table, as it were, but only in the outputted table. On a first attempt, this returned an "`invalid number`" error. Is what I'm asking clear enough and easily done? – nutty about natty May 25 '20 at 12:29
2

This should probably be done in PL/SQL which I do not know, but I wanted to see how far I could get it with pure SQL. This only replaces the first occurence of the code, so you would have to somehow run it multiple times.

select regexp_replace(s, '&#([0-9]+);', u) from
(select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from
(select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from
(select 'Hallöle sloven&#269;ina' s from dual)))

Or less readable but more usable:

SELECT 
REGEXP_REPLACE(s, '&#([0-9]+);', unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(regexp_replace(s, '.*?&#([0-9]+);.*$', '\1', 1, 1)), 'xxxx'), ' ', '')), 1, 1) 
FROM
(SELECT 'Hallöle sloven&#269;ina &#269; &#278;' s FROM DUAL)

This (updated) version correctly replaces the first occurrence. You need to apply it until all of them are replaced.

MK.
  • 33,605
  • 18
  • 74
  • 111
  • looks great! but isn't there a flag for the regexp_replace function that does the replacment multiple times in a string? I am pretty sure that perl, python and other languages support such a flag. otherwise I have to replace many many times for phrases that are e.g. in polish or slovak languages :-( – mawimawi Feb 07 '12 at 17:25
  • Yes, it can replace all, but the problem is that you are not replacing them with the same string every time. You need to extract code, convert it to unicode and replace that code with the unicode character. It is probably possible to do that with pure SQL somehow but it is definitely beyond my level of SQL expertise. – MK. Feb 07 '12 at 17:55
  • Replacing "269" with "160" returns the `must be followed by four hexadecimal characters` error](https://oracleerror.com/2016/07/30/ora-30186-must-be-followed-by-four-hexadecimal-characters-or-another/) (in 11g). Any idea why and how to resolve? My XML strings contain many of these ` `, amongst others. Thanks! – nutty about natty May 25 '20 at 13:23