2

Here is one thing I can't get my head around:

I am using Windows 7 and Strawberry Perl 5.20, and I want to write UTF-8 to the console (cmd.exe) with chcp 65001.

The UTF-8 characters themselves are coming out fine, even >255, but there is a mysterious duplication of some caracters (this only happens if I don't redirect into a file)

EDIT: I now have seen another post that had essentially the same problem at last-octet-repeated-when-my-perl-program-outputs-a-utf-8 -- the solution is to inject a binmode(STDOUT, 'unix:encoding(utf8):crlf') into the perl program -- I have tested and it works fine now

Thanks to everybody who looked into this weird problem.

In a nutshell, I am writing a UTF-8 string (chr(300) x 3).chr(301)."UVW\x{0D}\x{0A}", when I redirect into a flat file and then print the flat file, everything is fine.

However, when I print directly to the console, some characters are mysteriously duplicated (I am talking about the characters "VW" in the seperate line), and I don't know why

Here is my test-output

Page de codes active : 65001

Redirected into a file:
-----------------------
ĬĬĬĭUVW

Printed directly:
-----------------
ĬĬĬĭUVW
VW

IO-Layers = (unix crlf)

C4ACC4ACC4ACC4AD5556570D0A

Here is my test program:

@echo off
chcp 65001
echo.

set H1=BEGIN{binmode(*STDIN); undef $/;
set HEXDUMP="%H1% print uc(unpack('H*',<STDIN>)), qq{\n}}"

set L1=my @l = PerlIO::get_layers(*STDOUT, output, 1);
set LAYERS="%L1% print {*STDERR} qq{IO-Layers = (@l)\n};"

set PROG="print chr(300) x 3, chr(301), qq{UVW\n};";

set TFILE=%TEMP%\tfile.txt

echo Redirected into a file:
echo -----------------------
perl -C6 -e%PROG% >%TFILE% && type %TFILE%
echo.

echo Printed directly:
echo -----------------
perl -C6 -e%PROG%

echo.
perl -e%LAYERS%
echo.

perl -e%HEXDUMP% <%TFILE%

echo.
pause

As I said, the characters themselves are printed correctly, but why is there this mysterious duplication ? ...and why * only * if not redirected into a file ?

Community
  • 1
  • 1
user2288349
  • 267
  • 2
  • 12
  • Why are you printing U+0300 and U+0301, which are "combining grave accent" and "combining acute accent" respectively? Those accents are meant to be applied to a preceding character. Also, Perl on a Windows system will translate `"\n"` to CRLF implicitly, and there is no need to print a U+000D yourself. – Borodin Aug 30 '14 at 18:17
  • Ah, I see. Your actual code sends *decimal* 300 and 301, which is U+012C and U+012D, which are "capital I breve" and "small I breve", which matches the output you say you are getting. Even so, why are you printing such strange stuff? – Borodin Aug 30 '14 at 18:20
  • I am printing strange stuff to test that my program is rock-solid as far as the output is concerned. – user2288349 Aug 30 '14 at 18:29
  • I have edited my text to reflect that my code sends decimal 300 and 301 – user2288349 Aug 30 '14 at 18:36
  • 3
    I have seen another post that had essentially the same problem at [last-octed-repeated-in-utf8](http://stackoverflow.com/questions/23416075/why-am-i-getting-the-last-octet-repeated-when-my-perl-program-outputs-a-utf-8-en?rq=1) -- the solution is to inject a binmode(STDOUT, 'unix:encoding(utf8):crlf') into the perl program -- I have tested and it works fine now – user2288349 Aug 30 '14 at 18:51
  • I'm afraid I can't help. I can only imagine that you're looking at a bug in cmd.exe. Note that if you use something less exotic that "I breve" then it works fine. Also, if you decompose those characters into a separate letter I and a combining breve accent `"\x{0061}\x{0306}"`then it doesn't combine them, but still repeats the trailing characters on the next line. – Borodin Aug 30 '14 at 18:55
  • @hobbs: That's the question that the OP linked. – Borodin Aug 30 '14 at 20:24
  • @Borodin *after* my comment, if only by a few minutes :) – hobbs Aug 30 '14 at 21:17
  • @hobbs: Two comments above yours. – Borodin Aug 30 '14 at 21:33
  • @user2288349: Rather than editing your question, you should post your solution as an answer to your own problem, and then accept it. It is more worthy of being accepted than my own response, which, until it gained votes, I was tempted to delete as being litle more than just narrative. – Borodin Aug 31 '14 at 19:45

1 Answers1

3

As I suspected, this has been reported as a failure in Windows software:

This is caused by a bug in Windows. When writing to a console set to code page 65001, WriteFile() returns the number of characters written instead of the number of bytes.

I wasn't aware of a work-around, but if the :unix:encoding(utf8):crlf PerlIO stack works for you then it seems you have found one.

Borodin
  • 126,100
  • 9
  • 70
  • 144