2

I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:

    $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
    $String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)

This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:

"Helloäöü?→"

becomes

"Helloäöü?????"

What I want is to only convert valid bytes, without creating question marks, so the output will be:

"Helloäöü?"

Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...

MySurmise
  • 146
  • 11

1 Answers1

2

One option is to use a regex-based -replace operation based on named Unicode blocks:

"Helloäöü€?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}–—€‚‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•˜™š›œžŸ]'

Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes:

  • \p{IsBasicLatin} and \p{IsLatin-1Supplement matches characters that fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252, but is missing a few characters.

  • The explicitly enumerated characters (€...) are those Windows-1252 characters not present in ISO-8859-1 (which therefore have different code points in Unicode than in Windows-1252, namely outside the 8-bit range).

    • and (en dash and em dash) are placed first, so that they aren't mistaken for describing a range of characters (the .NET regex engine apparently allows their interchangeable use with -, the regular "dash" (ASCII-range hyphen).
    • (single low-9 quotation mark) is doubled in order to escape it, because PowerShell allows its interchangeable use with ' (single quotes) - see also: this answer summarizes all such interchangeable uses allowed in PowerShell.

By replacing all non-matching (^) characters with the (implied) empty string, all non-Windows-1252 characters are effectively removed.

A general caveat:

  • Due to the use of literal non-ASCII-range characters in the command, be sure that PowerShell interprets your script file's character encoding correctly, which notably means using UTF-8 files with BOM for the benefit of Windows PowerShell - see this answer.

However, your to-and-from-bytes encoding approach can be used with a slight adaptation, which works with any target encoding (without needing to enumerate individual characters, such as above):

Using a System.Text.EncoderReplacementFallback instance initialized with the empty string effectively removes all characters that cannot be represented in the target encoding.

$string = "Helloäöü€?→"

$encoding = [System.Text.Encoding]::GetEncoding(
  1252,
  # Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
  [System.Text.EncoderReplacementFallback]::new(''),
  [System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)

$string = $encoding.GetString($encoding.GetBytes($string))
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • This is a much more elegant solution. Do you have a specific online ressource to recommend that shows the differences between all those encodings? Also, thanks for upvoting, I can finally comment now :D – MySurmise Dec 13 '22 at 03:25
  • @MySurmise, re up-voting privilege :) Please see my update re covering _all_ Windows-1252 characters. As for resources: Wikipedia is a good source; I've added a link to Wikipedia's Windows-1252 article to the answer. – mklement0 Dec 13 '22 at 03:55