Are there one (or two) solid bugs in the `curses` shim for Perl?

Question

In furtherance to question Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any, I'm starting a separate question after more info has been collected:

I am trying out Bryan Henderson's Perl interface to the ncurses library: Curses, henceforth called "curses shim".

Tried with perl-Curses-1.36-9.fc30.x86_64
Tried with perl-Curses-1.36-11.fc31.x86_64

For a simple exercise, I try to obtain single characters typed on-screen. This is directly based off the NCURSES Programming HOWTO, with adaptations.

When I call the curses shim's getchar(), I expect to receive a character, possibly multibyte (It's a bit more complicated as explained in this part of the library manpage because one has to handle the special cases of function keys and no input, but that's just the usual curlicues).

I'm receiving the correct input. Mostly. This took me some time to actually ascertain. I have experimented quite a bit with the code, been confused from time to time, and also fought with Perl's it's-an-opaque-object-but-not-quite approach to Strings (actually, byte containers).

In the end:

Looks like there is at least one bug in the curses shim:

Strings from curses with a single character between 0x80 and 0xFF and not created correctly by the wide-character aware getchar(). There are additional NUL bytes in there, which confuses Perl, and which also confuse the curses printw function.

Let's see what we have got:

An overview of what we receiving characters q, è, д through Curses::getchar() which transform the win_t 2-byte wide character obtained from wget_wch() into a Perl string:

The "green case" works out.
The "orange case" fails. But there is (fragile) fix.
The "yellow case" works out, but probabyl accidentally.

The curses shim function printw is simple and simulates a printf:

In Curses.pm:

sub printw   { addstr(sprintf shift, @_) }

And this indicates another bug in the curses shim: addstr is the C curses library function for non-wide character processing. I don't know how the calling convention from Perl to C above is solved, but for me that line means that the string generated by sprintf must be necessarily UTF-8 encoded so that printing through addstr works on an UTF-8 enabled terminal. And it will work completely by accident. If Perl had UCS-2 encoded string, it wouldn't. The correct way to output wide chars is addwstr.

Here is some code.

Run as:

perl test.pl 2> OUT; reset

To see the problem in action. Enter p, è or д.

Run as:

perl test.pl --fixit  2> OUT; reset

To see the code in action which fixes the string with the character between 0x80 and 0xFF.

Tail the file OUT while this is running:

tail -f OUT

Also noted:

Contrary to what perldoc say, use bytes does not only apply to its scope. It break out of it and changes behaviour of ord program-wide.

CODE

If you try it, hit BACKSPACE to get out of the loop, because CTRL-C is no longer interpreted.

The terminal must be in UTF-8 mode.

#!/usr/bin/perl

use warnings;
use strict;
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

use Curses;                   # On Fedora: dnf install perl-Curses
use Encode qw(decode encode); # https://perldoc.perl.org/Encode.html

# Necessary for proper printing to a terminal expecting UTF-8
# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

# https://perldoc.perl.org/perllocale.html#The-setlocale-function

use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");

# ---
# Set STDOUT/STDERR to autoflush for proper debugging
# ---

select STDERR;
$| = 1;
select STDOUT;
$| = 1;

# ---
# Surrounds the actual program: set upand teardown curses
# ---

sub setup() {
   initscr();
   raw();
   keypad(1);
   noecho();
}

sub teardown {
   endwin();
}

# ===
# Functions below is where the action is
# ===

sub announce {
   my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
   return { refresh => 1 }
}

sub read_one_char {

   # Read a next character, waiting until it is there.
   # >>>>
   my ($ch, $key) = Curses::getchar();
   # <<<<

   if (defined $key) {
      # it's a function key
      printw "Function key pressed: $key"; 
      printw "\n";
      # we are done if "backspace" was hit
      return { done => ($key == KEY_BACKSPACE()) }
   }
   elsif (!defined $ch) {
      # it's an error
      printw "getchar() failed\n";
      return {}
   }
   else {
      output_with_fix($ch);
      return { ch => $ch }
   }
}

sub output_with_fix {
   my($str) = @_;
   print STDERR "Received new character\n";
   print STDERR " About: " . about($str) . "\n";

   my $curses_str = $str; # will go to curses via curses printw()
   my $stderr_str = $str; # will go to stderr via print()

   my $fixit  = defined($ARGV[0]) && (lc($ARGV[0]) eq "--fixit");
   my $fakeit = defined($ARGV[0]) && (lc($ARGV[0]) eq "--fakeit");

   if ($fixit || $fakeit) {

      my $byte_len;
      my $byte_ord;
      {
         # https://perldoc.perl.org/bytes.html 
         use bytes; 
         $byte_len = length($str);
         $byte_ord = ord($str);
      }

      if ($byte_ord >= hex('0x80') && $byte_len == 2) {
         print STDERR " UTF-8 (on 2 bytes)? NOT Fixing it!\n";
      }
      elsif ($byte_ord >= hex('0x80') && $byte_len == 1) {
         print STDERR " ISO-8859-1? Fixing it!\n";
         if ($fixit) { 
            # Try to "decode it as ISO-8859-1" and hope that Perl fixes whatever is wrong
            # The result is a nice string internally encoded as UTF-8, with its UTF-8 flag properly set
            $curses_str = decode('iso-8859-1',$str);
         }
         else {
            # Fake it to show the above is not a NOP
            $curses_str = $str;
         }
         print STDERR " Fixed curses string to: " . about($curses_str) . "\n"; 
      }
      elsif ($byte_ord < hex('0x80') && $byte_len == 1) {
         print STDERR " ASCII: Nothing to fix\n";
      }
      else {
         die "Something weird!"
      }
   }

   print STDERR " Interpolated string  : $stderr_str\n";
   print STDERR " Concatenated string  : " . $stderr_str . "\n";
   print STDERR " Sprintf-built string : " . sprintf("%s",$stderr_str) . "\n";

   printw("...printw printing interpolated string  : $curses_str\n");
   printw("...printw printing concatenated string  : " . $curses_str . "\n");
   printw("...printw printing sprintf-built string : " . sprintf("%s",$curses_str) . "\n");

}

sub feedback {
   my ($ch) = @_;
   printw "The pressed key is: ";
   attron(A_BOLD);
   # 
   printw($ch);
   #
   attroff(A_BOLD);
   printw("\n");
   return { refresh => 1 }  # should refresh
}

# ---
# main
# ---

sub do_curses_run {
   setup;
   printw "Locale: $loc\n"; # printing OK
   my $done = 0;
   while (!$done) {
      my $bubl = announce(); 
      refresh() if $$bubl{refresh};
      $bubl = read_one_char();
      $done = $$bubl{done};
      if (!$done && defined $$bubl{ch}) {
         $bubl = feedback($$bubl{ch}); 
         refresh() if $$bubl{refresh};
      }
   }
   teardown;
}

do_curses_run();

# ===
# ANNEX
# ===

# ---
# annex stuff: printing info about string
# ---

sub about {
   my($str) = @_;

   my $char_len = length($str);
   my $ord      = ord($str);
   my $mark     = (utf8::is_utf8($str) ? "yes" : "no");

   my $byte_ord;
   my $byte_len;
   my $byte_contents;
   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      $byte_len = length($str);
      $byte_ord = ord($str); 
      $byte_contents = sprintf("%vx", $str);
   }

   my $f1 = sprintf("%d bytes", $byte_len);
   my $f2 = sprintf(", %d chars", $char_len);
   my $f3 = sprintf(", utf-8: %s",$mark);
   my $f4 = sprintf(", ord: x%x", $ord);
   my $f4p = sprintf(", byte-ord: x%x", $byte_ord);
   my $f5 = sprintf(", byte-contents (hex, via %%vx): %s", $byte_contents);
   return $f1 . $f2 . $f3 . $f4 . $f4p . $f5;
}

Output for normal operations

Run as perl test.pl 2> OUT

On the terminal, è is not displayed:

Locale: en_GB.UTF-8
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  : p
...printw printing concatenated string  : p
...printw printing sprintf-built string : p
The pressed key is: p
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  :
...printw printing concatenated string  :
...printw printing sprintf-built string :
The pressed key is:
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  : д
...printw printing concatenated string  : д
...printw printing sprintf-built string : д
The pressed key is: д
Type any character to see it in bold! (or backspace to exit)

In OUT:

The lines marked with # are coming from the C-side of curses shim, where I added some printfs:

# Obtained win_t 0x0070
# Not UTF-8 string: 70 00
Received new character
 About: 1 bytes, 1 chars, utf-8: no, ord: x70, byte-ord: x70, byte-contents (hex, via %vx): 70
 Interpolated string  : p
 Concatenated string  : p
 Sprintf-built string : p
# Obtained win_t 0x00e8
# Not UTF-8 string: e8 00
Received new character
 About: 1 bytes, 1 chars, utf-8: no, ord: xe8, byte-ord: xe8, byte-contents (hex, via %vx): e8
 Interpolated string  : è
 Concatenated string  : è
 Sprintf-built string : è
# Obtained win_t 0x0434
# UTF-8 string: d0 b4 00
Received new character
 About: 2 bytes, 1 chars, utf-8: yes, ord: x434, byte-ord: xd0, byte-contents (hex, via %vx): d0.b4
 Interpolated string  : д
 Concatenated string  : д
 Sprintf-built string : д
# Obtained win_t 0x0107

Output for "fixit" operations

Run as perl test.pl --fixit 2> OUT

On the terminal, è is now displayed.

Locale: en_GB.UTF-8
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  : p
...printw printing concatenated string  : p
...printw printing sprintf-built string : p
The pressed key is: p
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  : è
...printw printing concatenated string  : è
...printw printing sprintf-built string : è
The pressed key is:
Type any character to see it in bold! (or backspace to exit)
...printw printing interpolated string  : д
...printw printing concatenated string  : д
...printw printing sprintf-built string : д
The pressed key is: д
Type any character to see it in bold! (or backspace to exit)

In OUT:

The lines marked with # are coming from the C-side of curses shim, where I added some printfs:

# Obtained win_t 0x0070
# Not UTF-8 string: 70 00
Received new character
 About: 1 bytes, 1 chars, utf-8: no, ord: x70, byte-ord: x70, byte-contents (hex, via %vx): 70
 ASCII: Nothing to fix
 Interpolated string  : p
 Concatenated string  : p
 Sprintf-built string : p
# Obtained win_t 0x00e8
# Not UTF-8 string: e8 00
Received new character
 About: 1 bytes, 1 chars, utf-8: no, ord: xe8, byte-ord: xe8, byte-contents (hex, via %vx): e8
 ISO-8859-1? Fixing it!
 Fixed curses string to: 2 bytes, 1 chars, utf-8: yes, ord: xe8, byte-ord: xc3, byte-contents (hex, via %vx): c3.a8
 Interpolated string  : è
 Concatenated string  : è
 Sprintf-built string : è
# Obtained win_t 0x0434
# UTF-8 string: d0 b4 00
Received new character
 About: 2 bytes, 1 chars, utf-8: yes, ord: x434, byte-ord: xd0, byte-contents (hex, via %vx): d0.b4
 UTF-8 (on 2 bytes)? NOT Fixing it!
 Interpolated string  : д
 Concatenated string  : д
 Sprintf-built string : д
# Obtained win_t 0x0107

Addendum

Of course, you need libcursesw.

If you have the Perl process pid, try:

pmap -p PID

which shows the per process having attached:

/usr/lib64/libncursesw.so.6.1
/usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so

Addendum 2

Tried to see what the C code does and added an fprintf directly into the multibyte handling code of curses/Curses-1.36/CursesFunWide.c, recompiled, didn't manage to override the system Curses.so with my own via LD_LIBRARY_PATH (why not? why is everything only working half of the time?), so replaced the system library directly in place (take THAT!).

#ifdef C_GET_WCH
    wint_t wch;
    int ret = wget_wch(win, &wch);
    if (ret == OK) {
        ST(0) = sv_newmortal();
        fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
        c_wchar2sv(ST(0), wch);
        XSRETURN(1);
    } else if (ret == KEY_CODE_YES) {
        XST_mUNDEF(0);
        ST(1) = sv_newmortal();
        sv_setiv(ST(1), (IV)wch);
        XSRETURN(2);
    } else {
        XSRETURN_UNDEF;
    }
#else

Addendum 3

The win_t (apparently the same as wchar_t) conversion code from CursesWide.c, converts the wint_t (here seen as wchar_t) received from wget_wch() into a Perl string. SV is the "scalar value" type.

See also: https://perldoc.perl.org/perlguts.html

Here with two fprintf inserted to see what is going on:

static void
c_wchar2sv(SV *    const sv,
           wchar_t const wc) {
/*----------------------------------------------------------------------------
  Set SV to a one-character (not -byte!) Perl string holding a given wide
  character
-----------------------------------------------------------------------------*/
    if (wc <= 0xff) {
        char s[] = { wc, 0 };
        fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_off(sv);
    } else {
        char s[UTF8_MAXBYTES + 1] = { 0 };
        char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
        *s_end = 0;
        fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_on(sv);
    }
}

Re "*Contrary to what perldoc say, use bytes does not only apply to its scope. It break out of it and changes behaviour of ord program-wide.*", [That's not true](https://pastebin.com/uUF4SvjM) — ikegami, Apr 01 '20 at 22:39
What part of this question isn't answered by [my earlier answer](https://stackoverflow.com/a/60943285/589924)? (I'm not being sarcastic) — ikegami, Apr 01 '20 at 22:40
@ikegami You are right, it's not true. What the hell? Oh well, another occasion to clean up the code. — David Tonhofer, Apr 02 '20 at 02:18
@ikegami Actually not sure, it helped greatly and I worked through it all again to whittle down the problem to failure of printing of characters 0x80 to 0xFF, and I strongly suspect anything on 2-byte UTF being printed correctly entirely by accident. I should should open a bug report on CPAN at that point. — David Tonhofer, Apr 02 '20 at 02:31
@ikegami ...plus the old question would have been totally mangled with the clarifications. And become another question. — David Tonhofer, Apr 02 '20 at 02:41
I showed 80..FF being printed properly with the modified `printw`. Could you provide a minimal, runnable demonstarion of the problem — ikegami, Apr 02 '20 at 02:44
But more minimal is not possible. You need curses. You need to initialize it. You need to print the character. You need to print info about the character. It's not an encoding issue of a string that is restricted to Perl only. — David Tonhofer, Apr 02 '20 at 02:51
At least clearly identify inputs required to reproduce the problem. Plus, clearly identify the desired output and expected output. I'm not spending hours on this again when I already provided working code. — ikegami, Apr 02 '20 at 03:04
@ikegami What do you mean "at least". Seriously, everything is up there. EVERYTHING. — David Tonhofer, Apr 02 '20 at 08:34
Again, that's the problem. There's way to much. You have failed to *clearly* or *minimally* demonstrate the problem. Those are my terms for looking into this further. Your call. — ikegami, Apr 02 '20 at 08:35
@ikegami Look, this is not constructive. I appreciate your help, but I don't see how to go any more minimal than this. Because Perl code alone won't cut it. You have to see the output of curses. — David Tonhofer, Apr 02 '20 at 08:40
...and you appear to be using `printw`? We [already established](https://stackoverflow.com/a/60943285/589924) `printw` is broken and can't be used. A fix was provided. No point in looking at this. (If I did, I'd be willing to bet I'd discover this is just a duplicate of your [earlier question](https://stackoverflow.com/q/60902949/589924).) — ikegami, Apr 02 '20 at 08:41
And you're still insisting on using `use bytes;` and `decode('iso-8859-1',$str)`. X_X — ikegami, Apr 02 '20 at 08:48
It seems the problem you are asking about is that `è` isn't appearing properly when using `printw`? [Asked](https://stackoverflow.com/q/60902949/589924) and [answered](https://stackoverflow.com/a/60943285/589924). So much time wasted :( — ikegami, Apr 02 '20 at 09:04
As for your claims that your post is already a minimal demonstration, [this](https://pastebin.com/mxJPDkgQ) shows otherwise. — ikegami, Apr 02 '20 at 09:16
This is a great study, but I can't go through it properly now -- just one question that strikes me at first: Why do you conclude that there are stray NULs in there? I see that you show them in C, but aren't those C-string endings? To me it seems that ikegami's answer to the previous question nails it, and it's what I had found and stated in comments there, that `get_char` was fine but `printw` wasn't. An undue zero byte would of course upend that but I haven't seen it in Perl? — zdim, Apr 03 '20 at 07:24
_It seems the problem you are asking about is that è isn't appearing properly when using printw? Asked and answered. So much time wasted :( _ Gee, sorry about that. For some reason, I didn't see that answer. Why so stressed, do you have a bus to take? — David Tonhofer, Apr 04 '20 at 22:42