How to read a UTF-8 character with External Table correctly

Question

I have a CSV file encoded in uft8. There is text in a field of the record

In the text are characters that are coded with two characters. e.g. in the word Österreich Character Ö is encoded as O and ¨ and not as Ö

If I open the file in the editor, then I see the correct word Österreich,

if I look at the file by selecting the external table, then I see O¨sterreich.

When creating the external table I already added the ACCESS Parameter CHARACTERSET AL32UTF8

ACCESS PARAMETERS (
 RECORDS DELIMITED BY NEWLINE 
 CHARACTERSET AL32UTF8 
 STRING SIZES ARE IN BYTES  
 skip 1
 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LDRTRIM 
 MISSING FIELD VALUES ARE NULL   
)

Is there a possibility or a parameter tell oracle to merge the two characters in the output to one?

According Unicode standard you should always use the shortest possible codepoint, i.e. `Ö` (`U+00D6`) instead of `O¨` (`U+004F U+0308`). What do you get when you run `SELECT DUMP(..., 1016) FROM ...`? Ensure you have character `U+0308 COMBINING DIAERESIS` rather than `U+00A8 DIAERESIS`. Most likely your client application is just not displaying it properly. Many clients have difficulties to display combined Unicode characters. — Wernfried Domscheit, Mar 14 '19 at 12:14
@WernfriedDomscheit, The result of the query is `Typ=1 Len=11 CharacterSet=WE8MSWIN1252: 4f,a8,73,74,65,72,72,65,69,63,68` — hotfix, Mar 14 '19 at 12:47
I read the data from external table in my `c#` Application and generate a pdf. Just tried to do it and the text is as mention before `O¨sterreich`and not `Österreich`. so it is not only a displaing "problem" of my client. We got the file from a customer. It's not a big problem right now, the customer is providing us with the file in a different encoding. Is only interesting, if this can be solved with external table — hotfix, Mar 14 '19 at 12:57
Looks like in your CSV file you really have `O¨` - which are in fact two different single characters and not `Ö`. You should edit the CSV file in a way that it contains the correct content. — Wernfried Domscheit, Mar 14 '19 at 12:58
@WernfriedDomscheit I have also seen that it is coded as two characters. It is interesting that the word is displayed correctly in the Editor(Notepad++). See screenshoot at the top. How does the editor know that it is one character and not two. I think that should be possible in the DB too — hotfix, Mar 14 '19 at 13:04
Österreich = `4F CC 88 73 74 65 72 72 65 69 63 68 ` , `Ö` is encoded with the first three Bytes, `O = 4F` and `¨ = CC 88 ` — hotfix, Mar 14 '19 at 13:14

score 1 · Answer 1 · answered Mar 14 '19 at 13:48

Looks like Oracle external table cannot convert U+004F U+0308 properly to single Ö

I assume the result will be fine if you migrate your database from WE8MSWIN1252 to AL32UTF8 (UTF-8), see https://docs.oracle.com/database/121/NLSPG/ch11charsetmig.htm#NLSPG011

Or modify the CSV file and replace character U+004F U+0308 by U+00D6, which is called Unicode Normalization

You may also have a look at Characters and Combining Marks

How to read a UTF-8 character with External Table correctly

1 Answers1