nroff/groff does not properly convert utf-8 encoded file

Question

I am having a utf-8 encoded roff-file that I want to convert to a manpage with

$ nroff -mandoc inittab.5

However, characters in [äöüÄÖÜ], e.g. are not displayed properly as it seems that nroff assumes ISO 8859-1 encoding (I am getting [Ã¤Ã¶Ã¼ÃÃÃ] instead. Calling nroff with the -Tutf8 flag does not change the behaviour and the locale environment variables are (I assume properly) set to

LANG=de_DE.utf8
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_PAPER="de_DE.utf8"
LC_NAME="de_DE.utf8"
LC_ADDRESS="de_DE.utf8"
LC_TELEPHONE="de_DE.utf8"
LC_MEASUREMENT="de_DE.utf8"
LC_IDENTIFICATION="de_DE.utf8"
LC_ALL=

Since nroff is only a wrapper-script and eventually calls groff I checked the call to the latter which is:

$ groff -Tutf8 -mandoc inittab.5

Comparing the byte-encodings of characters in the src file and the output file I am getting the following conversions:

character  src file  output file
---------  --------  -----------
ä          C3 A4     C3 83 C2 A4
ö          C3 B6     C3 83 C2 B6
ü          C3 BC     C3 83 C2 BC
Ä          C3 84     C3 83
Ö          C3 96     C3 83
Ü          C3 9C     C3 83
ß          C3 9F     C3 83

This behaviour seems very weird to me (why am I getting an additional C3 83 and have the original byte-sequence truncated alltogether for big umlauts and ß?)

Why is this and how can I make nroff/groff properly convert my utf-8 encoded file?

EDIT: I am using GNU nroff (groff) version 1.22.2

When you run say `less inittab.5` do you see proper characters? By the way the question is off topic for this site, you may have better luck at unix/linux stackexchange. — n. m. could be an AI, Oct 10 '18 at 05:21
Evidently nroff thinks its *input* is Latin-1 and tries to transcode it to UTF-8. Try running with -Tlatin1 to avoid transcoding. — n. m. could be an AI, Oct 10 '18 at 05:34
It looks like groff doesn't support UTF-8 input at all. https://www.gnu.org/software/groff/manual/html_node/Input-Encodings.html — n. m. could be an AI, Oct 10 '18 at 05:39
Ok, that makes sense. How come most of my Gentoo programs come with utf-8 encoded man pages then? I could convert them to latin1, but that would ommit other characters. Are you aware of a nroff alternative that supports utf-8 input? — Simon Fromme, Oct 10 '18 at 05:54

score 12 · Accepted Answer · answered Dec 05 '18 at 06:24

Unlike other troff implementations (namely Plan 9 and Heirloom troff), groff does not support UTF8 in documents. However, UTF8 output can be achieved using the preconv(1) pre-processor, which converts UTF8 characters in a file to groff native escape sequences.

Take for example this groff_ms(7) document:

.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the café down the street

äöüÄÖÜ

Using groff normally, we get:

                StackOverflow Test Document


                        ToasterKing


     I like going to the cafÃ© down the street

Ã¤Ã¶Ã¼ÃÃÃ

But when using preconv | groff or groff -k, we get:

                StackOverflow Test Document


                        ToasterKing


     I like going to the café down the street

äöüÄÖÜ

Looking at the output of preconv, you can see how it transforms characters into escape sequences:

.lf 1 so.ms
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the caf\[u00E9] down the street

\[u00E4]\[u00F6]\[u00FC]\[u00C4]\[u00D6]\[u00DC]

nroff/groff does not properly convert utf-8 encoded file

1 Answers1