11

I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm (where xx is a two digit sequential number) and the number of files created varies from run to run.

$metadataPath = "\\ServerPath\foo" 

function concatenateMetadata {
    $cFile = $metadataPath + "whiconcat.csv"
    Clear-Content $cFile
    $metadataFiles = gci $metadataPath
    $iterations = $metadataFiles.Count
    for ($i=0;$i -le $iterations-1;$i++) {
        $iFile = "whidata"+$i+".htm"
        $FileExists = (Test-Path $metadataPath$iFile -PathType Leaf)
        if (!($FileExists))
        {
            break
        }
        elseif ($FileExists)
        {
            Write-Host "Adding " $metadataPath$iFile
            Get-Content $metadataPath$iFile | Out-File $cFile -append
            Write-Host "to" $cfile
        }
    }
} 

The whidataXX.htm files are encoded UTF8, but my output file is encoded UTF16. When I view the file in Notepad, it appears correct, but when I view it in a Hex Editor, the Hex value 00 appears between each character, and when I pull the file into a Java program for processing, the file prints to the console with extra spaces between c h a r a c t e r s.

First, is this normal for PowerShell? or is there something in the source files that would cause this?

Second, how would I fix this encoding problem in the code noted above?

dwwilson66
  • 6,806
  • 27
  • 72
  • 117
  • 1
    In Powershell 6.0, this has been rendered unnecessary - Powershell now defaults to UTF-8 without encoding for redirection. See https://github.com/PowerShell/PowerShell/issues/4878 – kumarharsh May 22 '18 at 11:44

2 Answers2

17

The Out-* cmdlets (like Out-File) format the data, and the default format is unicode.

You can add an -Encoding parameter to Out-file:

Get-Content $metadataPath$iFile | Out-File $cFile -Encoding UTF8 -append

or switch to Add-Content, which doesn't re-format

Get-Content $metadataPath$iFile | Add-Content $cFile 
mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • And to confirm, Add-Content will simply append the new data to the existing file, correct? – dwwilson66 Oct 15 '13 at 18:39
  • Yes. It's counterpart Set-Content will overwrite the existing data. – mjolinor Oct 15 '13 at 18:41
  • 3
    In Powershell 6.0, this has been rendered unnecessary - Powershell now defaults to UTF-8 without encoding for redirection. See https://github.com/PowerShell/PowerShell/issues/4878 – kumarharsh May 22 '18 at 11:44
1

First, the fact that you get 2 bytes per character indicates that fixed length UTF16 is being used. More accurately, it is called UCS-2. This article explains that file redirection in Powershell causes the output to be in UCS-2. See http://www.kongsli.net/nblog/2012/04/20/powershell-gotchas-redirect-to-file-encodes-in-unicode/. That same article also provides a fix.

Tarik
  • 10,810
  • 2
  • 26
  • 40