How does ncurses output non-ascii characters?

Question

I'd like to know how ncurses (a c library) manages to put characters like ├, despite them not (to the best of my knowledge) being part of ASCII.

I would have assumed it was just drawing them pixel by pixel, but you can copy/paste them out of the terminal (in MacOS).

Although there is no native support for UTF-8 in the C language, you can still write C programs that read, manipulate, and output arbitrary bytes, usually represented with an `unsigned char` or `uint8_t` type. — David Grayson, Apr 28 '17 at 04:01

score 3 · Answer 1 · answered Apr 30 '17 at 17:10

ncurses puts characters such as ├ on the screen by assuming that your locale environment variables (LC_ALL and/or LC_CTYPE) match the terminal on which you are displaying. The environment variables indicate the encoding (e.g., UTF-8). There are other encodings and terminals which support those encodings, but generally speaking you'll mostly see UTF-8. If the environment and terminal cooperate, things "just work":

at startup, ncurses checks for the locale which a program has initialized, via setlocale, and determines if that uses UTF-8. It uses that information later.
when a program adds character strings, e.g., using addstr, ncurses uses the character-type information (set as a side-effect of calling setlocale), and uses standard C library functions for combining sequences of bytes which make up a multi-byte character, and converting those into wide characters. It stores those wide characters internally, and
when writing to the terminal, ncurses reverses the process, converting from wide characters to use the encoding assumed to be supported by the terminal (assuming that your locale environment matches the terminal).

However —

The character indicated ├ happens to be a special case. That is one of the graphic characters used for line-drawing, which predate Unicode and UTF-8. curses has names for these graphic characters, making it simple to refer to them, e.g., ACS_LTEE (the ├ is a left-tee):

Before UTF-8 came along to complicate things, developers came up with a scheme using a table of these graphic characters by adapting the escape sequences used for the VT100 (late 1970s) and the AT&T 4410 and 5410 terminals (apparently the early 1980s since the latter were in use by 1984) for drawing their graphic characters.
AT&T SystemV curses provided support for these graphic characters from the mid-1980s. BSD curses never did that...
Unicode (roughly 1990 and later) provided most of the same glyphs using a different encoding. There are a few omissions (the most noticeable are the scan lines above/below the one used for horizontal lines), but once UTF-8 got into use in the early 2000s, it was logical to extend ncurses to use these characters.
ncurses looks at the locale settings, but prefers using the terminal description for these graphic characters except for cases where that is known to not work — and will assume that the terminal can display the Unicode equivalents for these characters if the terminal is assumed to use UTF-8. It uses a table for this purpose (SystemV curses and its successor X/Open Curses didn't do any of this — NetBSD curses adapted the table from ncurses sometime after 2010).

Further reading:

NCURSES_NO_UTF8_ACS
Line Graphics (in curs_addch(3x))
Line Graphics (in curs_add_wch(3x))

Sorry, might I ask a very random off-topic question: how did ACS_LANTERN become a _snowman?_ The only source I've found is Dan Gookin: Programmer's Guide to NCurses, it suggests it would be a paragraph-mark. — Lorinczy Zsigmond, May 04 '17 at 11:06
It was the closest match in Linux console fonts in 1997 - a [storm lantern](https://www.google.com/search?q=storm+lantern&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwio99Wis9fTAhXFbSYKHXmiAhgQsAQIxQE) (Unicode had nothing better than that in 2002 when the built-in table was added). — Thomas Dickey, May 04 '17 at 23:35
Thank you for your answer; are you referring to some CP437 character? Some of them, indeed, might be associated with a storm lantern, eg 15 (U+A7), CE (U+256C), E8 (U+3A6), E9 (U+398), EB (U+3B4); also in CP850 there is CF (U+A4) — Lorinczy Zsigmond, May 05 '17 at 08:17
yes: the description in 1997 used **`CE`** (the closest to a "lamp", arguably). If there were a closer match in Unicode in 2002, that would have been used... — Thomas Dickey, May 05 '17 at 11:18
Thanks again. Perhaps then it wouldn't be to late to change ACS_LANTERN from U+2603 to U+256C. (Or maybe U+A7, if you believe the paragraph-theory.) — Lorinczy Zsigmond, May 05 '17 at 12:40
ncurses has a more useful assignment for [U+256C - see manpage](http://invisible-island.net/ncurses/man/curs_add_wch.3x.html). I'll take a look at the paragraph comment (though "lantern" is what's documented). — Thomas Dickey, May 05 '17 at 13:13

Davislor · Answer 2 · 2017-04-28T17:52:16.760

0

There is more than one version of ncurses, for more than one platform, and if you really want to know, check the source. However, none of them would draw a character pixel-by-pixel; that isn’t something a library running inside a terminal emulator does.

Modern versions of the C standard library, POSIX and ncurses all support writing wide characters to the console and conversion between wide and multibyte strings. Today, wide characters are normally UTF-16 or UTF-32 and multibyte strings are normally UTF-8. You can see the documentation for <wchar.h> and ncursesw for more information.

Note that C11 does have support for UTF-8 literals, through the u8 prefix.

A program that’s concerned about portability with systems where the local multibyte encoding is something other than UTF-8 can use another library such as the C++ standard library or ICU to convert between UTF-8 and wide-character strings, then display those with curses.

You might need to #define _XOPEN_SOURCE 700, or the appropriate value for the version of the standard you are targeting, and with some versions of the libraries, also #define _XOPEN_SOURCE_EXTENDED 1, to get your system libraries to let you use functions such as addwstr().

However, many programs might simply send strings of char encoded in UTF-8 to the console and assume it can handle them. I don’t recommend this approach, but it works on most Linux systems in 2017.

edited Apr 28 '17 at 17:52

answered Apr 28 '17 at 12:56

Davislor

14,674
2
34
49

There's more than one implementation of **curses**, but **ncurses** is a specific implementation (only one). – Thomas Dickey Apr 28 '17 at 16:48
@ThomasDickey There’s a MingW port also called ncurses. In any case, the way to see what it actually does is to download the source code and read it. – Davislor Apr 28 '17 at 16:49
@ThomasDickey There must be some differences, or it wouldn’t have a separate source tree. However, I’ve reworded the first sentence to clarify. – Davislor Apr 28 '17 at 17:54
packagers maintain source-trees which are copies of other source-trees (without a pointer to the specific one you have in mind, there's no way to point out how it is derived and used). – Thomas Dickey Apr 28 '17 at 19:20
@ThomasDickey In any case, I reworded that sentence to sidestep this matter of terminology. – Davislor Apr 28 '17 at 19:22

How does ncurses output non-ascii characters?

2 Answers2

Linked