8

i came across a little issue when dealing with csv-exports which contains mutated vowels like ä,ö,ü (German Language Umlaute)

i simply export with

Get-WinEvent -FilterHashtable @{Path=$_;ID=4627} -ErrorAction SilentlyContinue |export-csv -NoTypeInformation -Encoding Default -Force ("c:\temp\CSV_temp\"+ $_.basename + ".csv")

which works fine. i have the ä,ö,ü in my csv-file correctly.

after that i do a little sorting with:

Get-ChildItem 'C:\temp\*.csv' |
    ForEach-Object { Import-Csv $_.FullName } |
    Sort-Object { [DateTime]::ParseExact($_.TimeCreated, $pattern, $culture) } |
    Export-Csv 'C:\temp\merged.csv' -Encoding Default -NoTypeInformation -Force

i played around with all encodings, ASCII, BigEndianUnicode, UniCode(s) with no success.

how can i preserve the special characters ä,ö,ü and others when exporting and sorting?

mklement0
  • 382,024
  • 64
  • 607
  • 775
Peter Core
  • 193
  • 1
  • 2
  • 16
  • 3
    Make sure you supply the correct encoding to `Import-Csv` as well – Mathias R. Jessen Feb 23 '18 at 11:52
  • Awesome, i simply missed the encoding at the inbound of the foreach import object. Thank you so much, it works perfectly. – Peter Core Feb 23 '18 at 14:27
  • The fundamental rule of character encodings is to read bytes that represent text with the same encoding that it was written with. `Default` varies from machine to machine, user to user and even time to time, so it is very rarely one that a writer would want to use. – Tom Blodget Feb 25 '18 at 18:34
  • @TomBlodget: In the (legacy) _Windows_ world, the default encoding varies by _culture_ and is set at the _machine_ level (the _Windows "ANSI" code page_, ACP). While such an - unmarked - culture-dependent encoding is increasingly problematic, it is still very common, and Windows PowerShell _in part_ defaults to it - notably with the basic `Get-Content` and `Set-Content` cmdlets; despite even naming the ACP encoding `Default` when used with the `-Encoding` parameter, however, other cmdlets do _not_ use it by default, which is the truly problematic aspect here. – mklement0 Feb 25 '18 at 19:45
  • 1
    Sure it's troublesome when standard cmdlets don't have a consistent behavior. But, look at it this way: Since 1996, the Windows OS has used the Unicode character set. So, if your text source is the OS, such as via `Get-WinEvent`, and you don't use a Unicode encoding, such as UTF-8, you potentially get silent data loss. So, one strategy is to just specify UTF-8 everywhere. – Tom Blodget Feb 26 '18 at 03:32

1 Answers1

6

Mathias R. Jessen provides the crucial pointer in a comment on the question:

It is the Import-Csv call, not Export-Csv, that is the cause of the problem in your case:

Like Export-Csv, Import-Csv too needs to be passed -Encoding Default in order to properly process text files encoded with the system's active "ANSI" legacy code page, which is an 8-bit, single-byte character encoding such as Windows-1252.

In Windows PowerShell, even though the generic text-file processing Get-Content / Set-Content cmdlet pair defaults to Default encoding (as the name suggests), regrettably and surprisingly, Import-Csv and Export-Csv do not.

Note that on reading a default encoding is only assumed if the input file has no BOM (byte-order mark, a.k.a Unicode signature, a magic byte sequence at the start of the file that unambiguously identifies the file's encoding).

Not only do Import-Csv and Export-Csv have defaults that differ from Get-Content / Set-Content, they individually have different defaults:

  • Import-Csv defaults to UTF-8.
  • Export-Csv defaults to ASCII(!), which means that any non-ASCII characters -such as ä, ö, ü - are transliterated to literal ? chars., resulting in loss of data.

By contrast, in PowerShell Core, the cross-platform edition built on .NET Core, the default encoding is (BOM-less) UTF-8, consistently, across all cmdlets, which greatly simplifies matters and makes it much easier to determine when you do need to use the -Encoding parameter.


Demonstration of the Windows PowerShell Import-Csv / Export-Csv behavior

Import-Csv - defaults to UTF-8:

# Sample CSV content.
$str = @'
Column1
aäöü
'@

# Write sample CSV file 't.csv' using UTF-8 encoding *without a BOM*
# (Note that this cannot be done with standard PowerShell cmdlets.)
$null = new-item -type file t.csv -Force
[io.file]::WriteAllLines((Convert-Path t.csv), $str)

# Use Import-Csv to read the file, which correctly preserves the UTF-8-encoded
# umlauts
Import-Csv .\t.csv

The above yields:

Column1
-------
aäöü

As you can see, the umlauts were correctly preserved.

By contrast, had the file been "ANSI"-encoded ($str | Set-Content t.csv; -Encoding Default implied), the umlauts would have gotten corrupted.


Export-Csv - defaults to ASCII - risk of data loss:

Building on the above example:

Import-Csv .\t.csv | Export-Csv .\t.new.csv
Get-Content .\t.new.csv

yields:

"Column1"
"a???"

As you can see, the umlauts were replaced by literal question marks (?).

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    *Export-Csv defaults to ASCII(!)* - insane, but true! Thanks for this detailed answer. – Andre Feb 07 '22 at 08:54
  • Yeah, @Andre, it's quite unfortunate; glad to hear the answer helped. Thankfully, _PowerShell (Core) 7+_ now uses UTF-8, _consistently_ across all cmdlets. – mklement0 Feb 07 '22 at 13:32