2

I am using a PowerShell script and Regex to turn giant (>1GB) fixed field length text files into importable tab delimited files. The code is very fast. I need to change some of the captured fields (let's say the 4th, 6th, and 7th fields) to 0 if they are empty after trimming. Is there a super-fast way to do this, say as part of the regex capture without slowing this process down much?

DATA

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 


PROGRAM

$proc_yyyymm = '201912'
match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {
      $new_line = "$proc_yyyymm`t" + ($Matches[1..($Matches.Count-1)].Trim() -join "`t")
      $stream_out.WriteLine($new_line)
   }
}
Mark
  • 193
  • 11
  • I was thinking of pre-defining an array of field positions (e.g. - @(3,5,6)) that need to be modified from empty to 0 to use at run time depending on the format of the file being processed. – Mark Dec 11 '19 at 19:19
  • Yes, you would add code to test the trimmed value at those indices and substitute another value as appropriate. You could also use the [`[Regex]::Replace()` method](https://learn.microsoft.com/dotnet/api/system.text.regularexpressions.regex.replace) to do the same thing. By the way, just by posting this question it goes without saying that it's addressed to "someone who does" "maybe ... know this one". – Lance U. Matthews Dec 11 '19 at 19:32
  • Thanks @BACON I understand the test and assign option. I'm not sure how to use replace on specified capture groups or fields in my code. Do you have an example of that for my case? – Mark Dec 11 '19 at 19:41
  • FYI - in my actual implementation, I also stream an error file and log file. – Mark Dec 11 '19 at 21:20
  • Performance is critical in these loops as I process many files and each one is a monster. – Mark Dec 11 '19 at 21:22
  • I was going to ask if performance is critical why you're using regex instead of extracting the fields as fixed-length substrings from each line? Is it just because of the brevity of regex? Or because, evidently, you've found regex to be fast enough? – Lance U. Matthews Dec 11 '19 at 21:25
  • I don't know your field sized block trick :) So please share if you like or I'll google what I can find. Elegance and brevity also matters but the main thing is that I'm currently preparing 5 files for load in about 4 hours (unzip from source directory, parse and reformat for MS SQL Server upload, copy to staging). Faster would be better, but 10% longer for functionality or simplicity would not be catastrophic. The wrong tweak could easily push it into greater than a day (something not desirable). – Mark Dec 11 '19 at 21:34
  • @BACON, this process was handled manually by the previous 'guy' taking 2 to 3 days including the actual uploads. – Mark Dec 11 '19 at 21:36
  • I have added a couple non-regex solutions to my answer. I am _assuming_ that parsing in code is typically faster than regex, but perhaps regex can be fast(er) when you explicitly tell it the length of each capture like you have here. Also, in past benchmarks I've found that even when it's just connecting framework calls PowerShell code can be quite slow compared to the equivalent C#, so perhaps it's actually desirable to have execution spend its time within PowerShell/framework calls rather than the code of your script. – Lance U. Matthews Dec 11 '19 at 23:17
  • 1
    I have a solution there that you can test `if(! $value) { $value = 0 }` https://stackoverflow.com/a/57647495/6654942 – js2010 Dec 12 '19 at 15:54
  • I removed "PowerShell" from the title, among other edits, since it's lengthy and generally the language/environment is specified by and required to be in the tags, but not necessary in the title. Rolling back to an earlier revision (particularly a revision other than previous one) [undoes _all_ the intervening edits](https://stackoverflow.com/posts/59292588/revisions#spacer-dd7cb848-d497-4660-b3d2-3cde58576ab3). – Lance U. Matthews Dec 12 '19 at 20:55
  • I edited a previous 'no explanation for the edit' of the title to 'update the explanation only' and the title also automatically reverted/changed. I wasn't expecting that. :| – Mark Dec 12 '19 at 21:09

5 Answers5

3

Taking the work that @BACON and @TheMadTechnician provided, I thought that I would evaluate the different methods to parse the file and their relative performances.

Taking the example text:

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   

Copy/Paste and make a 10,000 line file out of it. With it I can on a small scale evaluate some performance of our various implementations.


Regular Expressions

@BACON's First example, a straight up Regex + empty string check:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'

$delimiter = ','
$indicesToNormalizeToZero = ,2
while ($line = $stream_in.ReadLine()) {
    if ($line -match $match_regex) {
        $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
        foreach ($index in $indicesToNormalizeToZero)
        {
            if ($trimmedMatches[$index] -eq '')
            {
                $trimmedMatches[$index] = '0'
            }
        }

        $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
        $stream_out.WriteLine($new_line)
    }
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 335

Overall - pretty good performance. We'll use this as our baseline for comparison.

Next up: Let's try the Sweet Match Evaluator soultion (kudos to @BACON for thinking this one up):

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
    param($match)

    # The first element of Groups contains the entire matched text; skip it
    $fields = $match.Groups `
        | Select-Object -Skip 1 `
        | ForEach-Object -Process {
            $field = $_.Value.Trim()
            if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
            {
                $field = '0'
            }
            return $field
        }
    return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}

$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'

while ($line = $stream_in.ReadLine()) {
    $new_line = $match_regex.Replace($line, $match_evaluator)

    # The original input string is returned if there was no match
    if (-not [Object]::ReferenceEquals($line, $new_line)) {
        $stream_out.WriteLine($new_line)
    }
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 1921

Match Groups Cool! YAY.... ugh.. but for pure performance... maaaybe not quite there. Likely because we would need to optimize the match evaluator more.

For completeness, let's try @TheMadTechnician solution:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'

while ($line = $stream_in.ReadLine()) {
    #Pad right to account for truncated lines and get captured values
    $LineMatch=$match_regex.Matches($line.PadRight(160)).groups.value
    #Replace first item in array with line header (first item is entire input line, which we don't want)
    $LineMatch[0] = $proc_yyyymm
    #Replace null columns on 4, 6, and 7 with 0
    switch($true){
        {$LineMatch[3] -notmatch '\S'} {$LineMatch[3]='0'}
    }

    $stream_out.WriteLine(($LineMatch.trim() -join "`t"))

}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 545

Not quite as good as we were expecting. I suspect it's because of the -notmatch implementing an additional regex implementation.


Pure PowerShell Way

Also, let's try a modified @js2010 method. @js2010's main advantage is that it is a "pure" PowerShell only way, using only PowerShell cmdlets.

$Stopwatch = [system.diagnostics.stopwatch]::startNew()

$proc_yyyymm = '201912'
$delimiter = ','
$cols = 0,11,35,54,71,80 # fake extra column at the end, assumes all rows are that wide, padded with spaces
$colsfile = "C:\Temp\perf\input.txt"
$csvfile = "C:\Temp\perf\out.CSV"

$firstline = get-content $colsfile | select -first 1
$headers = for ($i = 0; $i -lt $cols.count - 1; $i++) {
  $firstline.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()
}

Get-Content $colsfile | select -skip 1 | ForEach {
  $hash = [ordered]@{}
  $hash += @{ "proc_yyymm" = $proc_yyyymm }
  for ($i = 0; $i -lt $headers.length; $i++) {
    $value = $_.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()

    if(! $value) { $value = 0 }

    $hash += @{$headers[$i] = $value}
  }
  [pscustomobject]$hash
} | export-csv $csvfile -NoTypeInformation -Delimiter $delimiter

$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 1043

This didn't perform badly, and actually pretty good for a "native" solution. It has the benefit of being flexible (via hash tables), and @js2010 does something that I love, by utilizing Export-CSV it will output Correct CSV delimited files (or tab). The problem with the other above solutions is that the strings aren't wrapped in quotes to escape commas, or other quotes, from the text. Export-CSV will do that.


Substring Executions

Since we know we are dealing with fixed length columns, we can use Substrings instead of Regular Expressions. Let's see if there is any performance improvement.

@BACON's next example with a basic substring execution:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $id =                $line.Substring( 0, 11).Trim()
    $firstName =         $line.Substring(11, 24).Trim()
    $lastName =          $line.Substring(35, 19).Trim()
    $columnNameTooLong = $line.Substring(54, 17).Trim()
    $fifthColumn =       $line.Substring(71,  9).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 177

Wow, it looks like Substrings are running twice as fast as our baseline Regular Expression. This definitely looks like the path to go on.

Let's try @BACON's next example with a Extract Field function and multiple write statements:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $stream_out.Write($proc_yyyymm)
    $stream_out.Write($delimiter)
    $stream_out.Write($id)
    $stream_out.Write($delimiter)
    $stream_out.Write($firstName)
    $stream_out.Write($delimiter)
    $stream_out.Write($lastName)
    $stream_out.Write($delimiter)
    $stream_out.Write($columnNameTooLong)
    $stream_out.Write($delimiter)
    $stream_out.WriteLine($fifthColumn)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 5675

Wow. Running 17 times slower, is it the function call or the multiple writing to file calls that are causing the slowness? if I concatenate the Write to file:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 5654

The slowness is definitely caused by the additional function calls.

EDIT:

Let's check @BACON's latest extremely flexible and expandable, method with no function calls. Most of the time I would be willing to sacrifice some performance for flexibility:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $stream_out.Write($proc_yyyymm)

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        # If space is the only possible whitespace character,
        # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
        $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }

        $stream_out.Write($delimiter)
        $stream_out.Write($fieldText)
    }

    $stream_out.WriteLine()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 2645

Wow. Removing the explicit Function call saved half the time. It's still slower than the baseline, but is it due to us creating multiple strings or doing multiple Write() operations?.

If, instead of multiple Write() operations, let's accumulate it in a variable. It should be faster because it's all in memory right?

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $new_line = $proc_yyyymm

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        $new_line += $delimiter
        $new_line += if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }
    }

    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 2605

Well that was unexpected. I think in this case we are seeing the HDD buffering the write operations and masking what goes on. Dealing with larger data sets may change this, but somewhat confirms our thoughts that the read/write operations may have less of an impact on what we do than on how we are handling the many copies of strings in memory.


So we want to 1. Avoid function calls (they seem to generate lots more overhead), and 2. use substrings. Are there any other ways to do this?

I suspect that a lot of the overhead comes from assigning/reassigning memory. Basically, every time you manipulate a sting, e.g. .Substring or .Trim or assign to a variable you make a copy of the string in memory. What if we made things ugly and mashed everything into one string?

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $lastName = $line.Substring(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71,  9).Trim() -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 146

Definitely an improvement. Let's try one more thing:


String Builder

One other class we can investigate is the String Builder class. Let's take the last example and use the String Builder:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$sb = [System.Text.StringBuilder]::New(80)

while ($sb.Append($stream_in.ReadLine()).Length) {
    $lastName = $sb.ToString(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }
    $new_line = $proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter
    $stream_out.WriteLine($new_line)
    [void]$sb.clear()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 161

Overall, no real change in performance. The reason is that in this situation, we can't take advantage of some of the benefits String Builders provide. Because we need to Trim() the data, which is something that StringBulider can't do natively, we end up needing to convert it back into a string to perform the trim, which negates any performance improvement.


Super Optimize Performance

Now Let's get uglier. There is still one more variable assignment we can optimize out, Let's try with the StreamBuilder again:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$sb = [System.Text.StringBuilder]::New(80)

while ($sb.Append($stream_in.ReadLine()).Length) {
    $lastName = $sb.ToString(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }
    $stream_out.WriteLine($proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter)
    [void]$sb.clear()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 72

Wow! It seems that even the final variable assignment caused a lot of additional overhead. Likely due to it allocating several more memory locations for the string.

And finally, let's do a pure String.Substring implementation, with the same optimization:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $lastName =          $line.Substring(35, 19).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $stream_out.WriteLine($proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71,  9).Trim() -join $delimiter)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 57

This is the current best option: String.Substring and as little variable assignment possible, and it runs almost 6 times faster than the baseline RegEx. Now, from @Mark comments, if we can remove the Trim requirement, we could try StringBuilder again... Also, we could see improvements if we could read the whole file into memory (e.g. StringBuilder the whole file) that would be interesting...


The Story of the StringBuilder Saga!

To answer one of @Mark questions in the comments below, why does the while loop need a .Length in it? isn't that useless?:

while ($sb.Append($stream_in.ReadLine()).Length) 

Well, I thought so too.... Until I got stuck in an infinite loop.... but I didn't realize it was an infinite loop. I just thought I got something wrong and 10,000 lines of processing were spewing onto to the console. That is why you have:

`[void]$sb.clear()`

What you quickly realize when you start using it, the StringBuilder class is one noisy little bugger (especially when you are a PowerShell console). It decides to outputs it's whole life story any chance it can. That means when you Clear() it:

PS C:\> $sb.Clear()   

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647      0

Or Append data to it:

PS C:\Temp\perf> $sb.Append($stream_in.ReadLine())                                                                      

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647     80

Like a drunk guy at a bar, -Every- -single- -time- StringBuilder wants you to know how big it is and how much bigger it could be and who it is. And it says it to anyone that will listen. PowerShell is an enthusiastic listener, and will listen to anything anyone says, even if it is observing an Execution and sees bloody murder happening in front of them (see: $Error stream also see: when my boss wanders over and sees a blood red PowerShell console, and I say to just "ignore all of that stuff... um... it's not important"... then try to convince him to let you run your script against the Exchange server ;-).

In this case StringBuilder is full of itself, and every time it does something, it returns itself. PowerShell, being the receptive listener, will take the StringBuilder object and proclaim its presence to your loyal PowerShell Console host.

When you are processing a 10,000 long test document, and have 2 StringBuilder operations:

while ($sb.Append($stream_in.ReadLine())) {
...
    $sb.Clear()
}

You start spewing out 20,000 lines of messages to the console. For the experienced PowerShell coder, like pushing a magic HotKey, this ensues a quick Ctrl+C, open Vivaldi -> Google -> PowerShell StringBuilder Suppress Output -> I'm Feeling Lucky -> Stack Overflow -> Ctlr+C first answer: Aaand it looks like I'm going to start adding [void] to all of my StringBuilder operations to suppress output from now on:

while ([void]$sb.Append($stream_in.ReadLine())) {
...
    [void]$sb.Clear()
}

Excellent. Let's continue to run tests:

ElapsedMilliseconds: 1

Wow!!! 1 Millisecond! Woot! StringBuilder seems to be quantum level amazing! Let's check the output file:

PS C:\> Get-ChildItem .\out.txt                                                                                

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       2019-12-12   3:44 PM              0 out.txt

0Kb... Well nuts. On a quantum level, we forgot to put the cat into the box.

This is where careful observers will notice the fatal flaw in my Google Stack Overflow Copy Paste skills:

while ([void] ... ) {

Which, while mathematically equivalent to a O(0) operation, it is also known as skipping the while loop or, or also known as not doing anything. Not doing anything sort of doesn't solve the problem.

Well, ok, we'll get rid of the [void] and let's fall back on our tried and true tools when dealing with undesirable output. First just redirect the output:

while ($sb.Append($stream_in.ReadLine()) >> SomeWhereElse.txt ) {

0Kb... Or um... Maybe pipe it to Out-Null?

while ($sb.Append($stream_in.ReadLine()) | Out-Null) {

0Kb... grr... I should know they all functionally equate to:

while ( $null ) {

(Breath in) ok, let's think about this. If this is being run headless, I don't have to worry about a messy console. So let's just let it spew out 10,000 lines of stuff. I know writing to the console makes the execution 10 times longer, but what I'm really curious about right now is the speed.

while ($sb.Append($stream_in.ReadLine())) {
...
    [void]$sb.Clear()
}

Aaand run.

It's running. The console is just spewing out messages about the StringBuilder and I'm chuckling about how clever I was to think about using the StringBuilder class. It's running... hee hee hee... so few people think about the differences between Immutable vs. Mutable strings... hee hee hu? ... It's still running.... even if it ran 10 times slower it should be done by now... ok Ctrl+C -> Up Arrow -> Enter. Let's run the same thing again, and expect a different outcome.

Aaaand... What do you know... the same.

This is when you look at your PowerShell console. You silently know that it has a Screen Buffer Size of 9,999, and it has completely wrapped... probably... 3 times. Far more than it should. And when there is a while loop involved, you know that you messed up on the condition statement, and you have expertly crafted an infinite loop.

Let's go back to the previous example that we based it on:

while ($line = $stream_in.ReadLine()) {
...

We are essentially saying while $stream_in.ReadLine() has data loop. Naturally, we do the same with StringBuilder, only this time we know that we have to append it:

while ($sb.Append($stream_in.ReadLine())) {
...

I'm doing the same thing... right?

After some repeated swearing, fruitless googlefu, I gave up and plugged in a length check:

while ($sb.Append($stream_in.ReadLine()).Length) {
...

And it worked as expected. But I wasn't happy that I had to do the check. It's not supposed to work that way. It was supposed to work the easy way. I had the night to think about it, and in the morning I finally understood what I had done wrong, and why it wasn't working like that.

Remember earlier when I said that StringBuilder is full of itself? Well, it's full of itself and not 0. The thousands of messages in the console were saying what was wrong:

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647      0

Length = 0 The StringBuilder is empty. It was empty 21,000 lines ago.

The while loop condition statement wasn't evaluating on a String that either has data or 0/$null data, it was being fed a StringBuilder object. the StringBuilder object is not a 0 or a bool it's a big fat object which must be $true. The while loop condition statement would always evaluate to $true every time, even long after the StreamReader ran out of data.

Adding in a .Length made sure that the StringBuilder would return an actual number to the while loop, and ensure that it would stop when it ran out of data.

HAL9256
  • 12,384
  • 1
  • 34
  • 46
  • Curiosity was getting the better of me to try this regex vs. non-regex comparison, but you beat me to it. +1 So was the intent with `StringBuilder` to avoid a new `String` object for each line? That's clever. If you used `ReadBlock()` like in my answer you could `Append()` the `Char[]`, though trading a `String` for a `Char[]` is still a new `Object`, although doing a fixed-size read with no newline checking still might show gains. I was not sure if Powershell function calls would be slow, and this shows that they are. The alternative I had in mind to avoid that would be an array of (cont.) – Lance U. Matthews Dec 12 '19 at 18:09
  • (cont.) of objects like `[PSCustomObject] @{ Offset = 0; Length = 11; NormalizeIfEmpty = $false }` that would be looped over to read each field of a line; that might be tidier code but can't be faster than effectively unrolling that loop with `Substring()` calls like you've done. The thought I had to eliminate `Trim()` would be to manually walk through the `String`/`Char[]`, testing the `Char` at each index for `[Char]::IsWhiteSpace()`/`-eq [Char] ' '` to find the start of the field. That might be a win if this were C#, but, like I said in another comment, I have my doubts for PowerShell. – Lance U. Matthews Dec 12 '19 at 18:09
  • Please note, the trim requirement can not be removed. :) – Mark Dec 12 '19 at 19:48
  • In the real data, the fields are closer to 100 per line and the data is closer to 500,000 to 9,000,000 rows give or take. So, these things will also change which solution is ultimately the fastest. – Mark Dec 12 '19 at 19:49
  • @Mark We're not looking at eliminating the field trimming, just trimming specifically _with `Trim()`_ as a potential performance optimization. I was going to say, the brilliant/"cheating" answer would be a PowerShell script that uses `Add-Type` or a precompiled assembly to run C# code, although 57 milliseconds of the last solution above would be tough to beat. A straight translation of this code to a C# project wouldn't be hard and would assuredly be faster than PowerShell. This all started as a PowerShell regex question, eh? – Lance U. Matthews Dec 12 '19 at 20:07
  • @HAL9256 Is the $sb.Append($stream_in.ReadLine()).Length to remove blank lines? Or are you feeding the while statement Length because sb doesn't send it something else useful? I'm not familiar with that code pattern. – Mark Dec 12 '19 at 20:09
  • You're right @BACON. I'll post another question if I can't figure it out and decide to attempt the C# challenge. I was researching Add-Type, but as you correctly point out we already have some elegant and very fast choices. – Mark Dec 12 '19 at 20:10
  • FYI - I started with regex because I wanted to be able to plug in any pattern for a file as new file formats come in (without having to hard code new code each time). I was thinking pattern for capture and pattern for additional translation (like blank to zero) with a clean configuration by anyone (when I eventually move on). – Mark Dec 12 '19 at 20:14
  • I was just kidding that this question started as one thing and has evolved into a full-on performance race. The `while ($sb.Append($stream_in.ReadLine()).Length) { }` would be adding that line to the `StringBuilder` (note it gets `Clear()`ed at the end of the loop) and looping as long as lines of non-zero length are read; it's just a way of terminating the loop. I just updated my answer with a `function`-less solution that uses an array/CSV to define the fields; that may be more flexible/readable at the expense of some performance, although not much different than setting lengths in a regex. – Lance U. Matthews Dec 12 '19 at 20:21
  • @HAL9256 Can you explain what $groupsToNormalizeToZero -contains $_.Name does? Certain fields (unnamed, just have the field sequence number, and character position where the field begins and ends (e.g. 5,12 starts at 5, ends at 12, length is 7). – Mark Dec 12 '19 at 21:59
  • Sorry, I don't mean to keep jumping in for @HAL9256, but that comes from my `MatchEvaluator` solution. `$groupsToNormalizeToZero` is an array of regex group names to be set to `'0'` if they are found to be empty; since the regex captures are unnamed, in the matches they end up being the 1-based index as a `String`. `$_` is the current `GroupInfo` object from the `$match.Groups` collection. That test is checking, by `Name`, if the current group's is one that's a candidate for normalization. Perhaps confusingly, `$groupsToNormalizeToZero` is defined shortly _after_ `$match_evaluator`. – Lance U. Matthews Dec 13 '19 at 01:12
  • 1
    No problem @BACON I had to edit my answer to more fully answer exactly *why* I needed `.Length` when using a `StringBuilder` in a `while` loop. That answer morphed into "The Story of the StringBuilder Saga!" section. Enjoy! ;-) – HAL9256 Dec 13 '19 at 01:49
  • If you're not quite done. Here's the thing, the reason for favoring regex is primarily because you can plug in matching patterns. Can you think of way to turn the super-fast string builder solution into something that can be dynamically created once and then used in my loop? That way I could enjoy both the 'plugin-able', configurable clean code aspects and the ridiculous speed. I will post as a new question if desired, but the context is all here. As of now @BACON answer most closely satisfies all the big picture stuff, but you've got the speed analysis, etc. – Mark Dec 13 '19 at 16:51
  • I'm still looking for a pure regex match/replace pattern for replacing the matching fixed length fields with 0 (maybe a look-behind and look-ahead?); however, your options and explanations were the bomb! If you come up with something for my comment above, let me know. I'll post a matching question. I up-voted you but gave solution to BACON. – Mark Dec 13 '19 at 17:43
  • I agree @BACON deserves the answer. He came up with some nice flexible and unique solutions that "really" answer the question. I just took it to the super optimize extreme because I was curious about what programming patterns to follow to optimize for something like this. The only *real* way to optimize further, while providing flexibility, would be to essentially manually re-create the Trim function to use a full `StringBuilder` implementation.... – HAL9256 Dec 13 '19 at 21:21
2

After making some tweaks to your code for demonstration purposes...

  • Truncating the regular expression to match the sample data
  • Changing the output delimiter (now $delimiter) to a , so the results are easy to see
  • Using a StringReader and StringWriter for input and output, respectively

...given...

$text = @'
ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   
'@

...the way you proposed of adjusting the match text at specific indices would look like this...

$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'

$delimiter = ','
$indicesToNormalizeToZero = ,2

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    if ($line -match $match_regex) {
        $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
        foreach ($index in $indicesToNormalizeToZero)
        {
            if ($trimmedMatches[$index] -eq '')
            {
                $trimmedMatches[$index] = '0'
            }
        }

        $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
        $stream_out.WriteLine($new_line)
    }
}

$stream_out.ToString()

An alternative would be to use the [Regex]::Replace() method. This is good for when you need to perform a custom transformation on a match that can't be expressed in a regex substitution. Admittedly, it might be a poor fit here because you're matching an entire line instead of individual fields, so within a match you need to know which field is which.

$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
    param($match)

    # The first element of Groups contains the entire matched text; skip it
    $fields = $match.Groups `
        | Select-Object -Skip 1 `
        | ForEach-Object -Process {
            $field = $_.Value.Trim()
            if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
            {
                $field = '0'
            }

            return $field
        }

    return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}

$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    $new_line = $match_regex.Replace($line, $match_evaluator)

    # The original input string is returned if there was no match
    if (-not [Object]::ReferenceEquals($line, $new_line)) {
        $stream_out.WriteLine($new_line)
    }
}

$stream_out.ToString()

$match_evaluator is a MatchEvaluator delegate that gets called for each successful match found in the input text to Replace() and returns whatever you want the replacement text to be. Inside I'm doing the same kind of index-specific transformation, comparing the group name (which will be its index as a [String]) to a known list ($groupsToNormalizeToZero); you could use named groups instead, although I found that changes the ordering of $match.Groups. There may be better applications of [Regex]::Replace() here that aren't occurring to me now.

As an alternative to using regex, since their lengths are known you could extract the fields directly from $line using the Substring() method...

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    $id =                $line.Substring( 0, 11).Trim()
    $firstName =         $line.Substring(11, 24).Trim()
    $lastName =          $line.Substring(35, 19).Trim()
    $columnNameTooLong = $line.Substring(54, 17).Trim()
    $fifthColumn =       $line.Substring(71,  9).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.ToString()

Better still, since the length of each line is known you can avoid ReadLine()'s newline checks and subsequent String allocation by reading each line as a block of Chars and extracting the fields from there.

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $stream_out.Write($proc_yyyymm)
    $stream_out.Write($delimiter)
    $stream_out.Write($id)
    $stream_out.Write($delimiter)
    $stream_out.Write($firstName)
    $stream_out.Write($delimiter)
    $stream_out.Write($lastName)
    $stream_out.Write($delimiter)
    $stream_out.Write($columnNameTooLong)
    $stream_out.Write($delimiter)
    $stream_out.WriteLine($fifthColumn)
}

$stream_out.ToString()

Since @HAL9256's answer confirms that PowerShell functions are (very!) slow, a way to do the same thing without redundant code and without functions would be to define a collection of field descriptors and loop over that to extract each field from the appropriate offset...

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $stream_out.Write($proc_yyyymm)

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        # If space is the only possible whitespace character,
        # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
        $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }

        $stream_out.Write($delimiter)
        $stream_out.Write($fieldText)
    }

    $stream_out.WriteLine()
}

$stream_out.ToString()

I am assuming that direct string extraction would be faster than regex, but I don't know that to be $true in general let alone as it pertains to PowerShell; only testing would reveal that.

All of the above solutions yield the following output...

201912,ID,FIRST_NAME,LAST_NAME,COLUMN_NM_TOO_LON,5THCOLUMN
201912,10000000001,MINNIE,MOUSE,COLUMN VALUE LONG,STARTS
201912,10000000002,PLUTO,0,COLUMN VALUE LONG,STARTS
Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
  • FYI - My data sample was incorrect. The fields in my actual data either completely fill the field or are left padded, so looks I need to check the last character in the field string for whitespace to decide whether to set to 0. – Mark Dec 12 '19 at 16:07
  • @Mark can you update the Question with the updated data? I have some ideas for performance that would benefit from not needing to Trim the data. – HAL9256 Dec 12 '19 at 16:25
  • @HAL9256 Sorry all. I double-checked and the data is indeed right padded, the same as in the original sample. Let's continue to assume always fixed format always filled or padded with spaces. – Mark Dec 12 '19 at 19:41
  • @BACON What does the $stream_out.ToString() do at the end of your code? – Mark Dec 12 '19 at 20:36
  • That just displays the final output that was built in PowerShell, since I'm using `StringReader`/`StringWriter` instead of `StreamReader`/`StreamWriter`. – Lance U. Matthews Dec 12 '19 at 20:38
  • @BACON I didn't see where the file name/path was being read in and written out? Sorry if that's too simple of a question :) – Mark Dec 12 '19 at 21:40
  • There is no file. The input is read from a `StringReader` wrapped around `$text`, and the output is written to a `StringWriter`. This is noted at the start of my answer. [`StringReader`](https://learn.microsoft.com/dotnet/api/system.io.stringreader) and [`StreamReader`](https://learn.microsoft.com/dotnet/api/system.io.streamreader) provide the same interface because they both inherit from [`TextReader`](https://learn.microsoft.com/dotnet/api/system.io.textreader); it's just the underlying storage is different (a `String` vs. a `Stream`). Same for the `*Writer` classes. – Lance U. Matthews Dec 12 '19 at 21:46
  • I'm still looking for a pure regex match/replace pattern for replacing the matching fixed length fields with 0 (maybe a look-behind and look-ahead?); however, your solution definitely works and is reasonably fast in my tests. Thanks for the assist! – Mark Dec 13 '19 at 17:40
  • I don't think it's possible to transform the empty fields purely with (.NET) regex. The problem is the `0` isn't part of the matched text and isn't _always_ part of the replacement text, so you need a way to apply that conditional logic of "if all spaces were matched, substitute this other text instead." .NET supports [conditional matches](https://docs.microsoft.com/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr) but [not conditional replacements](https://stackoverflow.com/a/46035491/150605), which is where `MatchEvaluator` or custom code comes in. – Lance U. Matthews Dec 13 '19 at 21:01
  • @BACON I was thinking there might be a look behind approach that would still work with a fixed length grab. I might be wrong, I'm OK with regex, but not guru level so. – Mark Jan 14 '20 at 18:26
1

I'm not sure how performant Switch is, but if you convert your regex match string to a [regex] object and pass it through a switch this becomes pretty simple. The first response in a regex Matches method is always the entire input, so we can replace the first item in the response with your line header, and have a switch check for empty fields, and you'd be all set.

$proc_yyyymm = '201912'
[regex]$match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {
    #Pad right to account for truncated lines and get captured values
    $LineMatch=$match_regex.Matches($line.PadRight(160)).groups.value
    #Replace first item in array with line header (first item is entire input line, which we don't want)
    $LineMatch[0] = $proc_yyyymm
    #Replace null columns on 4, 6, and 7 with 0
    switch($true){
        {$LineMatch[4] -notmatch '\S'} {$LineMatch[4]='0'}
        {$LineMatch[6] -notmatch '\S'} {$LineMatch[6]='0'}
        {$LineMatch[7] -notmatch '\S'} {$LineMatch[7]='0'}
    }

    $stream_out.WriteLine(($LineMatch.trim() -join "`t"))

}

If you only want lines that match your regex, and want to ignore shorter lines you can get rid of the .PadRight(160) in that code.

If the switch is slowing things down you could perform a regex replacement as an alternative:

$LineMatch[4] = $LineMatch[4] -replace '^\s*$','0'

That would check if the string for column 4 is nothing but whitespace, and replace it with 0 if that is the case, or just leave it alone if it is not the case.

TheMadTechnician
  • 34,906
  • 3
  • 42
  • 56
  • I was trying to be brief with my code sample ... but regex is already [regex], the header is a throw away. I edited the above code accordingly, I'll study your answer more thoroughly and test. Thank you!! – Mark Dec 11 '19 at 20:56
  • It's fixed length fields for every row of the file so no need to pad. – Mark Dec 11 '19 at 21:04
0

I tried this but I'm thinking it's going to be too slow ... still testing.

PROGRAM

$proc_yyyymm = '201912'
[regex]match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{12})'

# deal with header row
if ($has_header_row) {
   $line = $stream_in.ReadLine()
}

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {

      $Matched = $Matches[1..($Matches.Count-1)].Trim()

      Foreach ($fld in ($file_info.numeric_fields)) {

         if ($Matched[$fld] -eq '') {
            $Matched[$fld] = '0'
         }
      }

      $new_line = ("$proc_yyyymm", "$Matched") -join "`t"
      $stream_out.WriteLine($new_line)
   }
}
Mark
  • 193
  • 11
0

I have a solution there that you can test if(! $value) { $value = 0 }. There's other solutions on that page too. https://stackoverflow.com/a/57647495/6654942

js2010
  • 23,033
  • 6
  • 64
  • 66