9

I have a script which creates users in Microsoft Exchange Server and Active Directory. So, though it's commmon that user's names have accents or ñ in Spain, I want to avoid them for the username to not to cause any incompatibilities in old systems.

So, how could I clean a string like this?

$name = "Ramón"

To be like that? :

$name = "Ramon"
Antonio Laguna
  • 8,973
  • 7
  • 36
  • 72

7 Answers7

22

As per ip.'s answer, here is the Powershell version.

function Remove-Diacritics {
param ([String]$src = [String]::Empty)
  $normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
  $sb = new-object Text.StringBuilder
  $normalized.ToCharArray() | % { 
    if( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
      [void]$sb.Append($_)
    }
  }
  $sb.ToString()
}

# Test data
@("Rhône", "Basíl", "Åbo", "", "Gräsäntörmä") | % { Remove-Diacritics $_ }

Output:

Rhone
Basil
Abo

Grasantorma
vonPryz
  • 22,996
  • 7
  • 54
  • 65
8

Well I can help you with some of the code.....

I used this recently in a c# project to strip from email addresses:

    static string RemoveDiacritics(string input)
    {
        string inputFormD = (input ?? string.Empty).Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (var i = 0; i < inputFormD.Length; i++)
        {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(inputFormD[i]);
            if (uc != UnicodeCategory.NonSpacingMark)
            {
                sb.Append(inputFormD[i]);
            }
        }

        return (sb.ToString().Normalize(NormalizationForm.FormC));
    }

I guess I can now say 'extending into a PowerShell script/form is left to the reader'.... hope it helps....

penderi
  • 8,673
  • 5
  • 45
  • 62
7

With the help of the above examples I use this "one-liner:" in pipe (tested only in Win10):

"öüóőúéáűí".Normalize("FormD") -replace '\p{M}', ''

Result:

ouooueeui
it_specialist
  • 71
  • 1
  • 2
7

Another PowerShell translation of @ip for non C# coders ;o)

function Remove-Diacritics 
{
  param ([String]$sToModify = [String]::Empty)

  foreach ($s in $sToModify) # Param may be a string or a list of strings
  {
    if ($sToModify -eq $null) {return [string]::Empty}

    $sNormalized = $sToModify.Normalize("FormD")

    foreach ($c in [Char[]]$sNormalized)
    {
      $uCategory = [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($c)
      if ($uCategory -ne "NonSpacingMark") {$res += $c}
    }

    return $res
  }
}

Clear-Host
$name = "Un été de Raphaël"
Write-Host (Remove-Diacritics $name )
$test = ("äâûê", "éèà", "ùçä")
$test | % {Remove-Diacritics $_}
Remove-Diacritics $test
JPBlanc
  • 70,406
  • 17
  • 130
  • 175
4
PS> [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding(1251).GetBytes("Ramón"))
Ramon
PS>
Damian Powell
  • 8,655
  • 7
  • 48
  • 58
  • Fails for some characters, e.g. `Æ×Þ°±ß…`. [A real _Old English_ example](https://www.researchgate.net/publication/277748378_Fore_daere_maerde_mod_astige_two_new_perspectives_on_the_Old_English_Gifts_of_men): returns `Fore ??re m?r?e?` if applied to `Fore ðære mærðe…` – JosefZ Mar 20 '16 at 16:03
3

Instead of creating a stringbuilder and looping over characters, you can just use -replace on the NFD string to remove combining marks:

function Remove-Diacritics {
param ([String]$src = [String]::Empty)
  $normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
  ($normalized -replace '\p{M}', '')
}
Peter Constable
  • 2,707
  • 10
  • 23
2

Another solution... quickly "reuse" your C# in PowerShell (C# code credits lost somewhere on the net).

Add-Type -TypeDefinition @"
    using System.Text;
    using System.Globalization;

    public class Utils
    {
        public static string RemoveDiacritics(string stIn)
        {
            string stFormD = stIn.Normalize(NormalizationForm.FormD);
            StringBuilder sb = new StringBuilder();

            for (int ich = 0; ich < stFormD.Length; ich++)
            {
                UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
                if (uc != UnicodeCategory.NonSpacingMark)
                {
                    sb.Append(stFormD[ich]);
                }
            }
            return (sb.ToString().Normalize(NormalizationForm.FormC));
        }
    }
"@ | Out-Null

[Utils]::RemoveDiacritics("ABC-abc-ČŠŽ-čšž")
blank3
  • 121
  • 1