0

I have a CSV file encoded in uft8. There is text in a field of the record

In the text are characters that are coded with two characters. e.g. in the word Österreich Character Ö is encoded as O and ¨ and not as Ö

If I open the file in the editor, then I see the correct word Österreich,

enter image description here

if I look at the file by selecting the external table, then I see O¨sterreich.

When creating the external table I already added the ACCESS Parameter CHARACTERSET AL32UTF8

ACCESS PARAMETERS (
 RECORDS DELIMITED BY NEWLINE 
 CHARACTERSET AL32UTF8 
 STRING SIZES ARE IN BYTES  
 skip 1
 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LDRTRIM 
 MISSING FIELD VALUES ARE NULL   
)

Is there a possibility or a parameter tell oracle to merge the two characters in the output to one?

hotfix
  • 3,376
  • 20
  • 36
  • According Unicode standard you should always use the shortest possible codepoint, i.e. `Ö` (`U+00D6`) instead of `O¨` (`U+004F U+0308`). What do you get when you run `SELECT DUMP(..., 1016) FROM ...`? Ensure you have character `U+0308 COMBINING DIAERESIS` rather than `U+00A8 DIAERESIS`. Most likely your client application is just not displaying it properly. Many clients have difficulties to display combined Unicode characters. – Wernfried Domscheit Mar 14 '19 at 12:14
  • @WernfriedDomscheit, The result of the query is `Typ=1 Len=11 CharacterSet=WE8MSWIN1252: 4f,a8,73,74,65,72,72,65,69,63,68` – hotfix Mar 14 '19 at 12:47
  • how can I ensure this (U + 0308 COMBINING DIAERESIS)? – hotfix Mar 14 '19 at 12:50
  • I read the data from external table in my `c#` Application and generate a pdf. Just tried to do it and the text is as mention before `O¨sterreich`and not `Österreich`. so it is not only a displaing "problem" of my client. We got the file from a customer. It's not a big problem right now, the customer is providing us with the file in a different encoding. Is only interesting, if this can be solved with external table – hotfix Mar 14 '19 at 12:57
  • Looks like in your CSV file you really have `O¨` - which are in fact two different single characters and not `Ö`. You should edit the CSV file in a way that it contains the correct content. – Wernfried Domscheit Mar 14 '19 at 12:58
  • @WernfriedDomscheit I have also seen that it is coded as two characters. It is interesting that the word is displayed correctly in the Editor(Notepad++). See screenshoot at the top. How does the editor know that it is one character and not two. I think that should be possible in the DB too – hotfix Mar 14 '19 at 13:04
  • How does it look like in a HEX editor? – Wernfried Domscheit Mar 14 '19 at 13:06
  • Österreich = `4F CC 88 73 74 65 72 72 65 69 63 68 ` , `Ö` is encoded with the first three Bytes, `O = 4F` and `¨ = CC 88 ` – hotfix Mar 14 '19 at 13:14

1 Answers1

1

Looks like Oracle external table cannot convert U+004F U+0308 properly to single Ö

I assume the result will be fine if you migrate your database from WE8MSWIN1252 to AL32UTF8 (UTF-8), see https://docs.oracle.com/database/121/NLSPG/ch11charsetmig.htm#NLSPG011

Or modify the CSV file and replace character U+004F U+0308 by U+00D6, which is called Unicode Normalization

You may also have a look at Characters and Combining Marks

Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110