4

I have a varchar2 field containing some text, including accented characters ( e.g. à,è,...) and other non literal characters (<,!,;...).

I need to transform this field in html compliant format, by using one of the following output formats:

input     output1     output2
    á     &#225;      &aacute;  
    â     &#226;      &acirc;

Here is an example of starting data and two possible needed output, with something I unsuccessfully tried:

with test(starting, needed, needed2) as (
    select 'abc À. < ! ; à ',
           'abc &Agrave;. &lt; ! ; &agrave;',
           'abc &#192;. &#60; ! ; &#224;'
    from dual)
select starting, needed, needed2, UTL_I18N.escape_reference(starting) as result, 'UTL_I18N' as function from test union all
select starting, needed, needed2, convert(starting, 'US7ASCII' ) ,               'convert' from test union all
select starting, needed, needed2, HTF.ESCAPE_SC(starting) ,                      'htf' from test union all
select starting, needed, needed2, asciiStr(starting) ,                           'ascii' from test union all
select starting, needed, needed2, dbms_xmlgen.convert(starting) ,                'dbms_xmlgen' from test union all
select starting, needed, needed2, TRANSLATE(starting USING CHAR_CS) ,            'translate' from test;

The output (needed and obtained so far):

STARTING             NEEDED                                   NEEDED2                                  RESULT                         FUNCTION
-------------------- ---------------------------------------- ---------------------------------------- ------------------------------ -----------
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc À. &lt; ! ; à              UTL_I18N
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc ?. < ! ; ?                 convert
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc À. &lt; ! ; à              htf
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc \00B7. < ! ; \2026         ascii
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc À. &lt; ! ; à              dbms_xmlgen
abc À. < ! ; à       abc &Agrave;. &lt; ! ; &agrave;          abc &#192;. &#60; ! ; &#224;             abc À. < ! ; à                 translate

The resulting string is used to build a file compliant with Excel (even Excel 2003), by applying a transformation on a XML; this does not support accented characters, so I need a conversion.

I could use some regular expressions, but I was trying to get a better solution. I'm using Oracle 11.2.0.3.0.

Aleksej
  • 22,443
  • 5
  • 33
  • 38
  • Is a function to parse the string and do a replace just too old fashioned? – kevinskio Apr 01 '16 at 12:59
  • 1
    I was looking for some built-in, to avoid handling the possible characters (à,è,é,<,>..) by code – Aleksej Apr 01 '16 at 13:13
  • 1
    Personally, I'd convert my columns to NVARCHAR, and output the data as UTF-8 for the web page. As you can tell from looking at the source to your own question on Stack Overflow, modern web pages cope fine with characters like "è" in the text of a web page, as long as the data's character encoding is handled properly from the database to the page. The only characters you need to worry about then (for HTML body text) are ampersands and angle brackets. – Matt Gibson Apr 01 '16 at 13:27
  • @MattGibson : You are right. Sorry, I missed something in the question. I need this to build an Excel file by applying a transformation on a XML string, and this does not support "è" and similar characters. Editing the question to better explain – Aleksej Apr 01 '16 at 13:32
  • 1
    You might have better luck extracting the data as an XML document, then running it thru a simple XSLT to recode in US-ASCII, which escapes all non-ASCII characters. – Stavr00 Apr 01 '16 at 14:09
  • 1
    @MattGibson, nowadays the default character set for an Oracle Database is AL32UTF8, i.e. usually you don't need to convert any column. All unicode characters are supported even by `VARCHAR2` data types – Wernfried Domscheit Apr 02 '16 at 06:00
  • @WernfriedDomscheit Thanks! I learned something new today. – Matt Gibson Apr 02 '16 at 07:40

2 Answers2

1

As far as I know there's no built-in function to change accented characters into html entities.

But you could use a function like this one:

create or replace function to_htmlentity(text varchar2)
return varchar2 deterministic is
    result varchar(200):='';
begin 
    for letter in (
        select 
        substr(text,level,1) as character,
        ascii(substr(text,level,1)) as ascii 
        from dual connect by level <= length(text)
    )loop
        if letter.ascii > 128 
        or letter.ascii in (34,38,60,62) --",&,<,> 
        then
            result := result||'&#'||letter.ascii||';';
        else
            result := result||letter.character;
        end if;
    end loop;
    return result;
end;

This function takes any non-ascii(128) character and changes into it's respective html entity. I find it more clear than using a RegEx.

Then you can use it as follows:

with test(starting, needed, needed2) as (
    select 'abc À. < ! ; à',
           'abc &Agrave;. &lt; ! ; &agrave;',
           'abc &#192;. &#60; ! ; &#224;'
    from dual)
select starting, needed, needed2, to_htmlentity(starting) as result from test
Nefreo
  • 2,162
  • 1
  • 15
  • 24
1

I think you have to start with ASCIISTR function. For non-Unicode characters it returns the UTF-16 codepoint in Hex format, e.g. ´00B7´

Then you need a converter to transform Hex numbers into decimal numbers, for example this one:

FUNCTION Base2Dec(BaseString IN VARCHAR2, Base IN INTEGER DEFAULT 16) 
    RETURN INTEGER DETERMINISTIC IS

    BaseNumber INTEGER := 0;
    HexString CONSTANT CHAR(16) := '0123456789ABCDEF';    
BEGIN
    IF Base > 16 THEN 
        RAISE NUMERIC_OVERFLOW;
    END IF;
    IF BaseString IS NULL THEN
        RETURN NULL;
    ELSE
        FOR i IN 1..LENGTH(BaseString) LOOP
            BaseNumber := BaseNumber * Base + INSTR(HexString, UPPER(SUBSTR(BaseString, i, 1))) - 1;
        END LOOP;    
        RETURN BaseNumber;
    END IF;
END Base2Dec;

Based on this you can compose your escaped strings.

Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110