Escaped Characters Outside the Basic Multilingual Plane (BMP) in Prolog

Question

For reference, I'm using Prolog v7.4.2 on Windows 10, 64-bit

Entering the following code in the REPL:

write("\U0001D7F6"). % Mathematical Monospace Digit Zero

gives me this error in the output:

ERROR: Syntax error: Illegal character code
ERROR: write("
ERROR: ** here **
ERROR: \U0001D7F6") .

I know for a fact that U+1D7F6 is a valid Unicode character, so what's up?

score 4 · Accepted Answer · answered Aug 16 '17 at 12:46

SWI-Prolog internally uses C wchar_t to represent Unicode characters. On Windows these are 16 bit and intended to hold UTF-16 encoded strings. SWI-Prolog however uses wchar_t to get nice arrays of code points and thus effectively only supports UCS-2 on Windows (code points u0000..uffff).

On non-Windows systems, wchar_t is usually 32 bits and thus the complete Unicode range is supported.

It is not a trivial thing to fix as handling wchar_t as UTF-16 looses the nice property that each element of the array is exactly one code point and using our own 32-bit type means we cannot use the C library wide character functions and have to reimplement them in SWI-Prolog. This is not only work, but replacing them with pure C versions also looses the optimization typically present in modern C runtime libraries.

score 2 · Answer 2 · answered Aug 22 '17 at 16:14

The ISO core standard syntax for char codes looks different. The following works in SICStus Prolog, Jekejeke Prolog, SWI-Prolog, etc.. for example, and is thus more portable:

Using SWI-Prolog on a Mac:

Welcome to SWI-Prolog (threaded, 64 bits, version 7.5.8)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.

?- set_prolog_flag(double_quotes, codes).
true.

?- X = "\x1D7F6\".
X = [120822].

?- write('\x1D7F6\'), nl.

And Jekejeke Prolog on a Mac:

Jekejeke Prolog 2, Runtime Library 1.2.2
(c) 1985-2017, XLOG Technologies GmbH, Switzerland

?- X = "\x1D7F6\".
X = [120822]

?- write('\x1D7F6\'), nl.

The underlying syntax is found in the ISO core standard at section 6.4.2.1 hexadecimal escape sequence. It reads as follows and is shorter than the U-syntax:

hex_esc_seq --> "\x" hex_digit { hex_digit } "\".

score 1 · Answer 3 · answered Aug 14 '17 at 15:24

1

For comparison, I get:

?- write('\U0001D7F6').

What is your environment and what do the flags say?

For example:

$ set | grep LANG
LANG=en_US.UTF-8

and also:

?- current_prolog_flag(encoding, F).
F = utf8.

answered Aug 14 '17 at 15:24

mat

40,498
3
51
78

my encoding is, evidently, "text". How do I tell prolog to use UTF-8? – junius Aug 14 '17 at 15:29
1

See the help on [**encoding**](http://eu.swi-prolog.org/pldoc/man?section=encoding) and also [`set_stream/2`](http://eu.swi-prolog.org/pldoc/man?predicate=set_stream/2). You probably need to set some environment variables or equivalent configuration. – mat Aug 14 '17 at 15:38
You are using the terminal stream. – mat Aug 14 '17 at 18:57
How do you refer to the terminal stream then? – junius Aug 14 '17 at 19:00
1

Please see the section about [**encoding**](http://eu.swi-prolog.org/pldoc/man?section=encoding) and also [`set_stream/2`](http://eu.swi-prolog.org/pldoc/man?predicate=set_stream/2). In particular, the aliases `current_input` and `current_output` may be important for your case. It may help to use for example `?- set_stream(current_input, encoding(utf8)).` and `?- set_stream(current_output, encoding(utf8)).`, but I can only guess at this point. – mat Aug 14 '17 at 19:03
Okay, I tried that and it doesn't make a difference. I still get complaints about illegal character codes. – junius Aug 14 '17 at 19:07
1

It reports a _syntax error_ during _read_. If you run this code on a Unix system with a _locale_ that cannot represent this character, reading is fine but _writing_ will raise an I/O error (not on the terminal; there the value is escaped). – Jan Wielemaker Aug 16 '17 at 12:51

Escaped Characters Outside the Basic Multilingual Plane (BMP) in Prolog

3 Answers3