5

In this output, why am I getting extra newlines after printing non-ASCII Unicode characters?

Platform is Windows Vista and problem occurs after chcp 65001 but not after chcp 850

C:\>chcp 850
Active code page: 850

C:\>perl unicode_bug_1.pl
Budweiser
Budweiser
Budweiser
Bud─øjovick├¢ Budvar
Bud─øjovick├¢ Budvar
Bud─øjovick├¢ Budvar

C:\>chcp 65001
Active code page: 65001

C:\>perl unicode_bug_1.pl
Budweiser
Budweiser
Budweiser
Budějovický Budvar

Budějovický Budvar

Budějovický Budvar

from this program

#!perl
use strict;
use warnings;

binmode (STDOUT, "encoding(UTF-8)"); # so no "Wide character in print" warning

print "Budweiser\n" for 1..3;
print "Bud\N{U+011B}jovick\N{U+00FD} Budvar\n" for 1..3;
hippietrail
  • 15,848
  • 18
  • 99
  • 158
RedGrittyBrick
  • 3,827
  • 1
  • 30
  • 51
  • 2
    No idea; not happening for me. Can you tell us anything about the environment where you are running this? – ysth Dec 31 '10 at 21:51

2 Answers2

3

This seems to be a bug in Perl. I had thought it was a bug in Windows code page 65001 not really being supported for the console but I finally made test programs in C and Perl and the problem does not happen in the C version. It happens no matter where the Unicode character occurs in the line but the line you're printing must be wider than the console supports.

Here is my C program:

#include "stdafx.h"

#include "Windows.h"


int _tmain(int argc, _TCHAR* argv[])
{
    BOOL b = SetConsoleOutputCP(65001);
    printf("set console output codepage returned %d\n", b);

    printf("cαfe\n");
    printf("1234567890 café\n");
    printf("1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n");
    printf("1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n");

    return 0;
}

And here is my Perl program:

#

use utf8;

binmode STDOUT, ':utf8';

printf STDOUT "cαfe\n";
printf STDOUT "1234567890 café\n";
printf STDOUT "1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 café\n";
printf STDOUT "1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 cαfe\n";

UPDATE

No I was wrong, with the help of some of the guys at #perl on irc.perl.org it turns out to be a bug in the Microsoft API. WriteFile is documented to return the number of bytes written but returns the number of characters written, which depends on the codepage. A bug was filed in March 2010.

There is more discussion in the MSDN forums.

UPDATE 2

I posted Michael Kaplan's blog, "Sorting it all out", about this problem and he responded with the article entitled "Hidden in plain site: a purloined letter kind of a bug report". He's a Microsoft internationalization expert so you will surely find some insights there...

Community
  • 1
  • 1
hippietrail
  • 15,848
  • 18
  • 99
  • 158
0

I'm not getting any newlines. Is your command line wide enough to fit your output?

Hugmeir
  • 1,249
  • 6
  • 9
  • My command line is wide enough but I have noticed that the problem doesn't happen if I set the code page to 850 using `chcp 850` - however then the characters dont all display properly. Windows Vista 32-bit, Activestate Perl 5.10.0 MSWin32-x86-multi-thread. – RedGrittyBrick Dec 31 '10 at 22:07
  • chcp output here: 932. Try that, maybe? – Hugmeir Dec 31 '10 at 22:32
  • @RedGrittyBrick, I don't see the described issue on Windows Vista 64-bit, Activestate Perl 5.10.1 MSWin32-x86-multi-thread. Maybe try upgrading your Perl install. – Ven'Tatsu Jan 01 '11 at 01:10
  • @Ven'Tatsu: I upgraded to Activestate latest version (5.12.2) - same problem. – RedGrittyBrick Jan 01 '11 at 17:53
  • @Hugmeir: Windows says "Invalid code page" when I type `chcp 932`. – RedGrittyBrick Jan 01 '11 at 17:55
  • 1
    @RedGrittyBrick: I think that's Japanese. but anyway, I'm not getting any newlines on XP, using chcp 850, 65001, nor 932, although none of those display the unicode characters properly. I can't get on a machine with Vista until Monday - Could I interest you in trying Windows Powershell, if you need the unicode output in the console? For whatever reason, it seems to be understanding the characters properly, but screwing up the display; try redirecting the output to a file (ala `perl script.pl > file.txt`), and the file will get the correct output. – Hugmeir Jan 01 '11 at 19:19
  • 2
    @Hugmeir, you are right, redirecting to a file proves that Per is outputting the correct sequence - it must be some quirk in the Vista command prompt. – RedGrittyBrick Jan 03 '11 at 22:35