1

I have the following scenario. I have a huge text that is full of words with a REPLACEMENT_CHARACTER "�". My script has already produced a dictionary that provides the correct Translation of theses words by using key value pairs. It Looks like:

"gew�hlte":  "gewählte"
"Betr�ge;":  "Beträge;"

I have about 1200 Entries in this Dictionary. On the (huge) Textfile im using this command in a loop to do my corrections:

foreach($key in $solutionsDictionary.Keys)
{
    #Replace the key with value.
    [String]$value = $solutionsDictionary[$key]
    (Get-Content -encoding UTF8 $file) -replace [Regex]::Escape($key), "$value" | Set-Content -encoding UTF8 $file
}

But it works incredibly slowly. To speed it up, I would like to filter on the lines that really contain this character and then correct these lines specifically by using the words as the key for my lexicon instead of trying each key until I have found the right one. However, I do not know how I can write back a single line into the file within the iteration and then continue looking for the next one? The new incomplete algorithm looks like this:

$SearchCharacter = '�'
$lines = get-content $file -encoding UTF8 | select-string $SearchCharacter
foreach ($line in $lines)
{
    # Split into words and find the ones which contain the searchCharacter
    $words = -split $line
    $words = @($words) -match $SearchCharacter

    foreach ($word in $words){
        # replace each word in the line.by using word as index.

        # Code missing here. How to write back a single line?
    }
}

If the "select-string" Property is the Problem, i can do the replacement without it. Any suggestion on how to do this? Thanks a lot!


Edit: The folllowing solution Came up:

$SearchCharacter = '�'
Get-Content $file -encoding UTF8 |
ForEach-Object {
    If ($_.Contains($SearchCharacter)) {
        $Words = $_ -Split '\s+'
        $words = @($words) -match $SearchCharacter
        ForEach ($Word in $Words) {
            If ($solutionsDictionary.ContainsKey($Word))
            {
                $_.Replace([Regex]::Escape($Word), $solutionsDictionary[$Word])
            }
        }
    }
    $_
} | Set-Content -encoding UTF8 $Outfile

It works so far, but it has another disadvantage. The target file receives one line for each corrected word. I just don't see how to prevent this. So for example with that Input:

Das hier r�ckg�ngig ist das zu machen
r�ckg�ngig : ist bereits geamcht
Weitere W�rter gibt ers zu korrigieren
Hier noch ein bl�des Wort
zwei in einer Zeile G�hte und Gr��e

I get this solution:

Das hier rückgängig ist das zu machen
Das hier r�ckg�ngig ist das zu machen
rückgängig : ist bereits geamcht
r�ckg�ngig : ist bereits geamcht
Weitere Wörter gibt ers zu korrigieren
Weitere W�rter gibt ers zu korrigieren
Hier noch ein blödes Wort
Hier noch ein bl�des Wort
zwei in einer Zeile Göhte und Gr��e
zwei in einer Zeile G�hte und Größe
zwei in einer Zeile G�hte und Gr��e

So how to Prevent PowerShell from writing a new line for every correction?


Edit2:

The Right solution for that is to insert the assignment of $_=

$SearchCharacter = '�'
Get-Content $file -encoding UTF8 |
ForEach-Object {
    If ($_.Contains($SearchCharacter)) {
        $Words = $_ -Split '\s+'
        $words = @($words) -match $SearchCharacter
        ForEach ($Word in $Words) {
            If ($solutionsDictionary.ContainsKey($Word))
            {
                $_ = $_.Replace([Regex]::Escape($Word), $solutionsDictionary[$Word])
            }
        }
    }
    $_
} | Set-Content -encoding UTF8 $Outfile
ImperatorMing
  • 47
  • 1
  • 8

1 Answers1

2

I would use your second idea together with the PowerShell pipeline for each $Line and a hash table to check for the special words:

$SearchCharacter = '�'
$ux4 = '\u{0:X4}' -f [bitconverter]::ToInt16([System.Text.Encoding]::Unicode.GetBytes($SearchCharacter))

$HashTable = ConvertFrom-StringData -Delimiter ':' '
gew�hlte: gewählte
Betr�ge: Beträge
'

Get-Content .\InFile.txt -encoding UTF8 |
ForEach-Object {
    If ($_ -Match "[\w$ux4]*$ux4+[\w$ux4]*") {
        ForEach ($Word in $Matches.Values) {
            If ($HashTable.ContainsKey($Word)) { $_ = $_.Replace($Word, $HashTable[$Word]) }
        }
    }
    $_
} | Set-Content -encoding UTF8 .\OutFile.txt
iRon
  • 20,463
  • 10
  • 53
  • 79
  • Seems to be the answer i was looking for. I will try that out and update this comment when im done. Thanks! – ImperatorMing May 07 '20 at 18:26
  • I have tried your solution. First of all, there was a misunderstanding. I already have a dictionary initialized like: $solutionsDictionary = @{}. It is mostly the same isnt it? Or do I get an advantage by converting to a hashtable? I don't think that's important, I just used my dictionary there. Or maybe I just mislabel this data type. Next, I inserted this line to provide there are just words with Special characters in the words Array: $words = @($words) -match $SearchCharacter. This is all a minor matter, I described the real problem in the edit of my post. Can you help? – ImperatorMing May 08 '20 at 06:45
  • 1
    Okay i already figured it out. The assignment is missing. The line should be: $_ = $_.Replace($Word, $HashTable[$Word]) – ImperatorMing May 08 '20 at 07:09
  • Ist all Right. Thanks for your help great Job! I mark as best answer – ImperatorMing May 08 '20 at 09:09
  • Seems like i cannot add the following to your post, Maybe you do: $_ = and $words = @($words) -match $SearchCharacter. – ImperatorMing May 08 '20 at 09:11
  • I have given it some more thought as I presume that the `-Split '\s+'` might cause issues because it will leave characters attached to words (as punctuation in e.g. `der Punkt nach dem Betr�ge.`). Therefore you probably want to search for word characters (`\w`) and if you doing that anyways, it is probably faster to directly select just the words with the special character (but that probably depends how often it appears in your text). – iRon May 08 '20 at 17:03