4

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was in UTF-8, however, judging by the first answer, this is an incorrect assumption.

Original question (which may still be valid):

I have read about UTF-8 Surrogate pairs, which may achieve what I require, and I've seen a few examples involving the stringinfo object TextElementEnumerators, but I couldn't work out a practical proof of concept.

Can someone provide an example of how you would write and read UTF-16 (probably using this surrogate pair concept) to a postgres database. Thank you.

Updated question: Why would the č character be returned from the database as a question mark?

We use NPGSQL to access the database and VB.Net.

Mr Shoubs
  • 14,629
  • 17
  • 68
  • 107
  • Use a different tool, like PgAdmin III, to see whether the text is inserted correctly (in which case PgAdmin III will see it fine) or mangled on insert.That'll help you figure out whether your bug is reading the text out, or inserting it in the first place. – Craig Ringer Dec 10 '11 at 05:22
  • Also, if you're chopping and changing UTF-8 text, check your code for assumptions that 1 byte = 1 character, because that's *not* valid for UTF-8. Ditto 2 bytes = 1 character assumptions for UTF-16. Breaking up a UTF-8 extended character will cause all sorts of funky results. – Craig Ringer Dec 10 '11 at 05:23
  • Apologies. Have removed the my answer. Was using two SQL variants at the time I ran into something similar. Must have been the other one. – Philip Couling Dec 10 '11 at 10:52

3 Answers3

9

There's no such thing as a character which exists in UTF-16 but not UTF-8. Both are capable of encoding all of Unicode. In other words, if you can get UTF-8 to work, it should be able to store any valid Unicode text.

EDIT: Surrogate pairs are actually a feature of UTF-16 rather than UTF-8. They allow a character which isn't in the basic multi-lingual plane (BMP) to be represented as two UTF-16 code units. Basically, UTF-16 is often treated as a fixed-width encoding (exactly two bytes per Unicode character) but that only allows the BMP to be encoded cleanly. Surrogate pairs are a (fairly hacky) way of extending the range beyond the BMP.

I very much doubt that the character you're trying to represent is outside the BMP, so I suspect you need to look elsewhere for the problem. In particular, it's worth dumping the exact character values of the text (e.g. by casting each char to int) before it goes into the database and after you've fetched it. Ideally, do this in a short but complete console app.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • My knowledge isn't vast on this, so it sounds like I am wrong - I will update the question and remove my assumption. – Mr Shoubs Dec 09 '11 at 16:36
  • Thanks Jon, you've pointed me in the right direction, this could be down to the version of Postgres we are using or our clustering software. I wrote a completely local test and didn't have the same problem that exists in our production system. – Mr Shoubs Dec 09 '11 at 17:00
  • 1
    @MrShoubs: That may mean that the problem isn't where you think it is though - are you able to run your test program against your production system to verify that it fails there? (Glad the advice is helping though.) – Jon Skeet Dec 09 '11 at 17:12
1

How can I store all UTF-16 "characters" in a Postgres database?

Short answer, this is not directly possible as PostgreSQL only supports a UTF-8 character set.

UTF-16 based formats like Java, JavaScript, Windows can contain half surrogate pairs which have no representation in UTF-8 or UTF-32. These may easily be created by sub-stringing a Java, JavaScript, VB.Net string. As they cannot be represented in UTF-8 or UTF-32 and thus cannot be stored in a database which only supports an UTF-8 character set like PostgreSQL.

Windows Path names may contain half surrogate pairs which cannot be read as utf-8 ( https://github.com/rust-lang/rust/issues/12056 ).

One would have to use database system which supports a UTF-16/CESU-8 character set which is more adapted to Java/Android, JavaScript/NodeJS, .Net/wchar_t/Windows languages/platforms. (SQLServer, Oracle (UTF-8 collation), DB2, Informix, HANA, SQL Anywhere, MaxDB typically support such a charset.

Note that with emoticons being represented as unicode codepoints outside the Basic Multilingual Plane these differences will become more relevant also for western users.

On postgres you may: a) Accept the losses, b) Store the data as binary data or c) translate them to an encoded representation (e.g. the JSON rfc encodes them as two escaped characters to be able to transport half surrogates within an UTF-8/Ascii based network format without loss (https://www.rfc-editor.org/rfc/rfc4627 Section 2.5).

With e.g. emoticons being located outside the Basic multilingual plane this problem will become more relevant also in the western world.

Depending on the pick of language Application Server ( Java,Scala, C#/Windows, JavaScript/NodeJS) vs go and the level of investment into language support (using e.g. ICU string splitting functions at grapheme boundaries (https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) instead of simple truncation the issue may be less relevant. But the majority of enterprise systems and languages fall in the UTF-16 camp today, with software using a simple sub-string operations.

Community
  • 1
  • 1
user6649841
  • 296
  • 2
  • 5
0

As to the problem storing/retrieving č

  1. Check the character set the Postgre db is running on is UTF-8 character set (https://www.postgresql.org/docs/9.1/multibyte.html ) or a character set which can represent the character.

  2. Check that the client connection to the database is set up to perform the appropriate codepage conversion ( for VB.Net this would be from UTF-16LE to UTF-8 or the database charset, this is typically a parameter on the connection string (charset) ).

  3. Check that the input is the actual UTF-8 / UTF-16 in VB.net byte sequence, not the Windows-1250 byte sequence.

  4. Check that this is not simply a limitation of the output tool or console (e.g. a Windows console typically does not display unicode characters but uses Windows-12xx character set (one can try https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8) but typically inspecting the byte sequence in a VB.Net debugger is best.

  5. Check that the length of the CHAR/VARCHAR column is sufficient to store your representation, even if represented in NFKD decomposition.

The grapheme you indicate has several different unicode representations.

 U+010D LATIN SMALL LETTER C WITH CARON
 U+0063 LATIN SMALL LETTER c followed by U+030C COMBINING CARON

And a different representations other character sets (e.g. 0xE8 in ISO-8859-2/Windows-1250 (https://en.wikipedia.org/wiki/Windows-1250) or ISO-8859-13 /Windows-1257.

All unicode representations fall into the basic multilingual plane, so the UTF-16 surrogate issue with postgre as indicated in the question title and answered below is likely irrelevant to your problem.

user6649841
  • 296
  • 2
  • 5