3

I have an html file test.html created with atom which contains:

Testé encoding utf-8

When I read it with Powershell console (I'm using French Windows)

Get-Content -Raw test.html

I get back this:

Testé encoding utf-8

Why is the accent character not printing correctly?

AP.
  • 8,082
  • 2
  • 24
  • 33
user310291
  • 36,946
  • 82
  • 271
  • 487

2 Answers2

7
  • The Atom editor creates UTF-8 files without a pseudo-BOM by default (which is the right thing to do, from a cross-platform perspective).

  • Windows PowerShell[1] only recognizes UTF-8 files with a pseudo-BOM.

    • In the absence of the pseudo-BOM, PowerShell interprets files as being formatted according to the system's legacy ANSI codepage, such as Windows-1252 on US systems, for instance.
      (This is also the default encoding used by Notepad, which it calls "ANSI", not just when reading files, but also when creating them. Ditto for Windows PowerShell's Get-Content / Set-Content (where this encoding is called Default and is the actual default and therefore needn't be specified); by contrast, Out-File / > creates UTF-16LE-encoded files (Unicode) by default.)

Therefore, in order for Get-Content to recognize a BOM-less UTF-8 file correctly in Windows PowerShell, you must use -Encoding utf8.


[1] By contrast, the cross-platform PowerShell Core edition commendably defaults to UTF-8, consistently across cmdlets, both on reading and writing, so it does interpret UTF-8-encoded files correctly even without a BOM and by default also creates files without a BOM.

mklement0
  • 382,024
  • 64
  • 607
  • 775
1
# Created a UTF-8 Sig File 
notepad .\test.html

# Get File contents with/without -raw
cat .\test.html;Get-Content -Raw .\test.html
Testé encoding utf-8
Testé encoding utf-8

# Check Encoding to make sure
Get-FileEncoding .\test.html
utf8

As you can see, it definitely works in PowerShell v5 on Windows 10. I'd double check the file formatting and the contents of the file you created, as there may have been characters introduced which your editor might not pick up.

If you do not have Get-FileEncoding as a cmdlet in your PowerShell, here is an implementation you can run:

function Get-FileEncoding([Parameter(Mandatory=$True)]$Path) {
    $bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)

    if(!$bytes) { return 'utf8' }

    switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
        '^efbbbf'   {return 'utf8'}
        '^2b2f76'   {return 'utf7'}
        '^fffe'     {return 'unicode'}
        '^feff'     {return 'bigendianunicode'}
        '^0000feff' {return 'utf32'}
        default     {return 'ascii'}
    }
}
AP.
  • 8,082
  • 2
  • 24
  • 33
  • 2
    Get-FileEncoding is not recognized on my powershell though I'm on windows 10 ? – user310291 Mar 01 '17 at 22:22
  • The OP created their file with GitHub's Atom editor, which creates UTF-8 files _without a pseudo-BOM_ by default, and that's the cause of the problem. Notepad does _not_ create UTF-8 files by default - it uses your system's _legacy codepage_ by default (e.g, Windows-1252 on English-language systems), and so does PowerShell when _reading_ a file without a BOM, that's why you didn't see the problem. As an aside: `cat` is just an alias for `Get-Content` on Windows, so there's no point in contrasting the two commands. – mklement0 Mar 02 '17 at 00:40
  • 1
    `Get-FileEncoding` is not a standard cmdlet. The best way to examine the file is to use standard cmdlet `Format-Hex` (PSv5+) and study the raw bytes. I found two likely `Get-FileEncoding` sources: from [here at poshcode.org](http://poshcode.org/2059) or as part of the [PowerShellCookbook module](https://www.powershellgallery.com/packages/PowerShellCookbook/1.3.6) in the PowerShell Gallery. Neither version reports UTF-8 for me (Windows 10, PSv5.1): the former only looks for a BOM and reports ASCII if there's none (which is true for `test.html`); similarly, the latter falls back to UTF-7. – mklement0 Mar 02 '17 at 04:18
  • Thanks for providing the `Get-FileEncoding` function. However, like the versions I linked to, it only looks at _BOMs_, and when it reports `ascii`, that really means "I don't know what the encoding is, because the file has no BOM" (and I'm slightly curious why a zero-byte file is `utf8`). However, it is sufficient to verify your claim that Notepad creates UTF-8 files by default: If I do what you state in your answer, using your function - having made sure that there's no preexisting file `.\test.html` and pasting text `Testé encoding utf-8`, I get `ascii`, _not `utf8`_. What do you get? – mklement0 Mar 03 '17 at 22:39
  • 1
    So I use [Notepad2](https://sourceforge.net/projects/notepad2/) and thus was able to change the file encoding to: `UTF-8 Signature`. Yes you are correct, since when I use the standard `UTF-8` w/o signature, I get `ascii` from the function as well – AP. Mar 04 '17 at 22:43