In furtherance to question Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any, I'm starting a separate question after more info has been collected:
I am trying out Bryan Henderson's Perl interface to the ncurses library: Curses, henceforth called "curses shim".
- Tried with perl-Curses-1.36-9.fc30.x86_64
- Tried with perl-Curses-1.36-11.fc31.x86_64
For a simple exercise, I try to obtain single characters typed on-screen. This is directly based off the NCURSES Programming HOWTO, with adaptations.
When I call the curses shim's getchar()
, I expect to receive a character, possibly multibyte (It's a bit more complicated as explained in this part of the library manpage because one has to handle the special cases of function keys and no input, but that's just the usual curlicues).
I'm receiving the correct input. Mostly. This took me some time to actually ascertain. I have experimented quite a bit with the code, been confused from time to time, and also fought with Perl's it's-an-opaque-object-but-not-quite approach to Strings (actually, byte containers).
In the end:
Looks like there is at least one bug in the curses shim:
Strings from curses with a single character between 0x80
and 0xFF
and not created correctly by the wide-character aware getchar()
. There are additional NUL bytes in there, which confuses Perl, and which also confuse the curses printw
function.
Let's see what we have got:
An overview of what we receiving characters q
, è
, д
through Curses::getchar()
which transform the win_t
2-byte wide character obtained from wget_wch()
into a Perl string:
- The "green case" works out.
- The "orange case" fails. But there is (fragile) fix.
- The "yellow case" works out, but probabyl accidentally.
The curses shim function printw
is simple and simulates a printf
:
In Curses.pm:
sub printw { addstr(sprintf shift, @_) }
And this indicates another bug in the curses shim: addstr
is the C curses library function for non-wide character processing. I don't know how the calling convention from Perl to C above is solved, but for me that line means that the string generated by sprintf
must be necessarily UTF-8 encoded so that printing through addstr
works on an UTF-8 enabled terminal. And it will work completely by accident. If Perl had UCS-2 encoded string, it wouldn't. The correct way to output wide chars is addwstr
.
Here is some code.
Run as:
perl test.pl 2> OUT; reset
To see the problem in action. Enter p
, è
or д
.
Run as:
perl test.pl --fixit 2> OUT; reset
To see the code in action which fixes the string with the character between 0x80
and 0xFF
.
Tail the file OUT
while this is running:
tail -f OUT
Also noted:
- Contrary to what perldoc say,
use bytes
does not only apply to its scope. It break out of it and changes behaviour oford
program-wide.
CODE
If you try it, hit BACKSPACE to get out of the loop, because CTRL-C is no longer interpreted.
The terminal must be in UTF-8 mode.
#!/usr/bin/perl
use warnings;
use strict;
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
use Curses; # On Fedora: dnf install perl-Curses
use Encode qw(decode encode); # https://perldoc.perl.org/Encode.html
# Necessary for proper printing to a terminal expecting UTF-8
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
# https://perldoc.perl.org/perllocale.html#The-setlocale-function
use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");
# ---
# Set STDOUT/STDERR to autoflush for proper debugging
# ---
select STDERR;
$| = 1;
select STDOUT;
$| = 1;
# ---
# Surrounds the actual program: set upand teardown curses
# ---
sub setup() {
initscr();
raw();
keypad(1);
noecho();
}
sub teardown {
endwin();
}
# ===
# Functions below is where the action is
# ===
sub announce {
my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
return { refresh => 1 }
}
sub read_one_char {
# Read a next character, waiting until it is there.
# >>>>
my ($ch, $key) = Curses::getchar();
# <<<<
if (defined $key) {
# it's a function key
printw "Function key pressed: $key";
printw "\n";
# we are done if "backspace" was hit
return { done => ($key == KEY_BACKSPACE()) }
}
elsif (!defined $ch) {
# it's an error
printw "getchar() failed\n";
return {}
}
else {
output_with_fix($ch);
return { ch => $ch }
}
}
sub output_with_fix {
my($str) = @_;
print STDERR "Received new character\n";
print STDERR " About: " . about($str) . "\n";
my $curses_str = $str; # will go to curses via curses printw()
my $stderr_str = $str; # will go to stderr via print()
my $fixit = defined($ARGV[0]) && (lc($ARGV[0]) eq "--fixit");
my $fakeit = defined($ARGV[0]) && (lc($ARGV[0]) eq "--fakeit");
if ($fixit || $fakeit) {
my $byte_len;
my $byte_ord;
{
# https://perldoc.perl.org/bytes.html
use bytes;
$byte_len = length($str);
$byte_ord = ord($str);
}
if ($byte_ord >= hex('0x80') && $byte_len == 2) {
print STDERR " UTF-8 (on 2 bytes)? NOT Fixing it!\n";
}
elsif ($byte_ord >= hex('0x80') && $byte_len == 1) {
print STDERR " ISO-8859-1? Fixing it!\n";
if ($fixit) {
# Try to "decode it as ISO-8859-1" and hope that Perl fixes whatever is wrong
# The result is a nice string internally encoded as UTF-8, with its UTF-8 flag properly set
$curses_str = decode('iso-8859-1',$str);
}
else {
# Fake it to show the above is not a NOP
$curses_str = $str;
}
print STDERR " Fixed curses string to: " . about($curses_str) . "\n";
}
elsif ($byte_ord < hex('0x80') && $byte_len == 1) {
print STDERR " ASCII: Nothing to fix\n";
}
else {
die "Something weird!"
}
}
print STDERR " Interpolated string : $stderr_str\n";
print STDERR " Concatenated string : " . $stderr_str . "\n";
print STDERR " Sprintf-built string : " . sprintf("%s",$stderr_str) . "\n";
printw("...printw printing interpolated string : $curses_str\n");
printw("...printw printing concatenated string : " . $curses_str . "\n");
printw("...printw printing sprintf-built string : " . sprintf("%s",$curses_str) . "\n");
}
sub feedback {
my ($ch) = @_;
printw "The pressed key is: ";
attron(A_BOLD);
#
printw($ch);
#
attroff(A_BOLD);
printw("\n");
return { refresh => 1 } # should refresh
}
# ---
# main
# ---
sub do_curses_run {
setup;
printw "Locale: $loc\n"; # printing OK
my $done = 0;
while (!$done) {
my $bubl = announce();
refresh() if $$bubl{refresh};
$bubl = read_one_char();
$done = $$bubl{done};
if (!$done && defined $$bubl{ch}) {
$bubl = feedback($$bubl{ch});
refresh() if $$bubl{refresh};
}
}
teardown;
}
do_curses_run();
# ===
# ANNEX
# ===
# ---
# annex stuff: printing info about string
# ---
sub about {
my($str) = @_;
my $char_len = length($str);
my $ord = ord($str);
my $mark = (utf8::is_utf8($str) ? "yes" : "no");
my $byte_ord;
my $byte_len;
my $byte_contents;
{
# https://perldoc.perl.org/bytes.html
use bytes;
$byte_len = length($str);
$byte_ord = ord($str);
$byte_contents = sprintf("%vx", $str);
}
my $f1 = sprintf("%d bytes", $byte_len);
my $f2 = sprintf(", %d chars", $char_len);
my $f3 = sprintf(", utf-8: %s",$mark);
my $f4 = sprintf(", ord: x%x", $ord);
my $f4p = sprintf(", byte-ord: x%x", $byte_ord);
my $f5 = sprintf(", byte-contents (hex, via %%vx): %s", $byte_contents);
return $f1 . $f2 . $f3 . $f4 . $f4p . $f5;
}
Output for normal operations
Run as perl test.pl 2> OUT
On the terminal, è
is not displayed:
Locale: en_GB.UTF-8
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string : p
...printw printing concatenated string : p
...printw printing sprintf-built string : p
The pressed key is: p
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string :
...printw printing concatenated string :
...printw printing sprintf-built string :
The pressed key is:
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string : д
...printw printing concatenated string : д
...printw printing sprintf-built string : д
The pressed key is: д
Type any character to see it in bold! (or backspace to exit)
In OUT
:
The lines marked with #
are coming from the C-side of curses shim, where I added some printfs:
# Obtained win_t 0x0070
# Not UTF-8 string: 70 00
Received new character
About: 1 bytes, 1 chars, utf-8: no, ord: x70, byte-ord: x70, byte-contents (hex, via %vx): 70
Interpolated string : p
Concatenated string : p
Sprintf-built string : p
# Obtained win_t 0x00e8
# Not UTF-8 string: e8 00
Received new character
About: 1 bytes, 1 chars, utf-8: no, ord: xe8, byte-ord: xe8, byte-contents (hex, via %vx): e8
Interpolated string : è
Concatenated string : è
Sprintf-built string : è
# Obtained win_t 0x0434
# UTF-8 string: d0 b4 00
Received new character
About: 2 bytes, 1 chars, utf-8: yes, ord: x434, byte-ord: xd0, byte-contents (hex, via %vx): d0.b4
Interpolated string : д
Concatenated string : д
Sprintf-built string : д
# Obtained win_t 0x0107
Output for "fixit" operations
Run as perl test.pl --fixit 2> OUT
On the terminal, è
is now displayed.
Locale: en_GB.UTF-8
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string : p
...printw printing concatenated string : p
...printw printing sprintf-built string : p
The pressed key is: p
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string : è
...printw printing concatenated string : è
...printw printing sprintf-built string : è
The pressed key is:
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string : д
...printw printing concatenated string : д
...printw printing sprintf-built string : д
The pressed key is: д
Type any character to see it in bold! (or backspace to exit)
In OUT
:
The lines marked with #
are coming from the C-side of curses shim, where I added some printfs:
# Obtained win_t 0x0070
# Not UTF-8 string: 70 00
Received new character
About: 1 bytes, 1 chars, utf-8: no, ord: x70, byte-ord: x70, byte-contents (hex, via %vx): 70
ASCII: Nothing to fix
Interpolated string : p
Concatenated string : p
Sprintf-built string : p
# Obtained win_t 0x00e8
# Not UTF-8 string: e8 00
Received new character
About: 1 bytes, 1 chars, utf-8: no, ord: xe8, byte-ord: xe8, byte-contents (hex, via %vx): e8
ISO-8859-1? Fixing it!
Fixed curses string to: 2 bytes, 1 chars, utf-8: yes, ord: xe8, byte-ord: xc3, byte-contents (hex, via %vx): c3.a8
Interpolated string : è
Concatenated string : è
Sprintf-built string : è
# Obtained win_t 0x0434
# UTF-8 string: d0 b4 00
Received new character
About: 2 bytes, 1 chars, utf-8: yes, ord: x434, byte-ord: xd0, byte-contents (hex, via %vx): d0.b4
UTF-8 (on 2 bytes)? NOT Fixing it!
Interpolated string : д
Concatenated string : д
Sprintf-built string : д
# Obtained win_t 0x0107
Addendum
Of course, you need libcursesw
.
If you have the Perl process pid, try:
pmap -p PID
which shows the per process having attached:
/usr/lib64/libncursesw.so.6.1
/usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so
Addendum 2
Tried to see what the C code does and added an fprintf
directly into the multibyte handling code of curses/Curses-1.36/CursesFunWide.c
, recompiled, didn't manage to override the system Curses.so
with my own via LD_LIBRARY_PATH
(why not? why is everything only working half of the time?), so replaced the system library directly in place (take THAT!).
#ifdef C_GET_WCH
wint_t wch;
int ret = wget_wch(win, &wch);
if (ret == OK) {
ST(0) = sv_newmortal();
fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
c_wchar2sv(ST(0), wch);
XSRETURN(1);
} else if (ret == KEY_CODE_YES) {
XST_mUNDEF(0);
ST(1) = sv_newmortal();
sv_setiv(ST(1), (IV)wch);
XSRETURN(2);
} else {
XSRETURN_UNDEF;
}
#else
Addendum 3
The win_t
(apparently the same as wchar_t
) conversion code from CursesWide.c
, converts the wint_t
(here seen as wchar_t
) received from wget_wch()
into a Perl string. SV
is the "scalar value" type.
See also: https://perldoc.perl.org/perlguts.html
Here with two fprintf
inserted to see what is going on:
static void
c_wchar2sv(SV * const sv,
wchar_t const wc) {
/*----------------------------------------------------------------------------
Set SV to a one-character (not -byte!) Perl string holding a given wide
character
-----------------------------------------------------------------------------*/
if (wc <= 0xff) {
char s[] = { wc, 0 };
fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_off(sv);
} else {
char s[UTF8_MAXBYTES + 1] = { 0 };
char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
*s_end = 0;
fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
sv_setpv(sv, s);
SvPOK_on(sv);
SvUTF8_on(sv);
}
}