Taking the work that @BACON and @TheMadTechnician provided, I thought that I would evaluate the different methods to parse the file and their relative performances.
Taking the example text:
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002PLUTO COLUMN VALUE LONGSTARTS
Copy/Paste and make a 10,000 line file out of it. With it I can on a small scale evaluate some performance of our various implementations.
Regular Expressions
@BACON's First example, a straight up Regex + empty string check:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$delimiter = ','
$indicesToNormalizeToZero = ,2
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_regex) {
$trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
foreach ($index in $indicesToNormalizeToZero)
{
if ($trimmedMatches[$index] -eq '')
{
$trimmedMatches[$index] = '0'
}
}
$new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
$stream_out.WriteLine($new_line)
}
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 335
Overall - pretty good performance. We'll use this as our baseline for comparison.
Next up: Let's try the Sweet Match Evaluator soultion (kudos to @BACON for thinking this one up):
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
param($match)
# The first element of Groups contains the entire matched text; skip it
$fields = $match.Groups `
| Select-Object -Skip 1 `
| ForEach-Object -Process {
$field = $_.Value.Trim()
if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
{
$field = '0'
}
return $field
}
return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}
$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'
while ($line = $stream_in.ReadLine()) {
$new_line = $match_regex.Replace($line, $match_evaluator)
# The original input string is returned if there was no match
if (-not [Object]::ReferenceEquals($line, $new_line)) {
$stream_out.WriteLine($new_line)
}
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 1921
Match Groups Cool! YAY.... ugh.. but for pure performance... maaaybe not quite there. Likely because we would need to optimize the match evaluator more.
For completeness, let's try @TheMadTechnician solution:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
while ($line = $stream_in.ReadLine()) {
#Pad right to account for truncated lines and get captured values
$LineMatch=$match_regex.Matches($line.PadRight(160)).groups.value
#Replace first item in array with line header (first item is entire input line, which we don't want)
$LineMatch[0] = $proc_yyyymm
#Replace null columns on 4, 6, and 7 with 0
switch($true){
{$LineMatch[3] -notmatch '\S'} {$LineMatch[3]='0'}
}
$stream_out.WriteLine(($LineMatch.trim() -join "`t"))
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 545
Not quite as good as we were expecting. I suspect it's because of the -notmatch
implementing an additional regex implementation.
Pure PowerShell Way
Also, let's try a modified @js2010 method. @js2010's main advantage is that it is a "pure" PowerShell only way, using only PowerShell cmdlets.
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$proc_yyyymm = '201912'
$delimiter = ','
$cols = 0,11,35,54,71,80 # fake extra column at the end, assumes all rows are that wide, padded with spaces
$colsfile = "C:\Temp\perf\input.txt"
$csvfile = "C:\Temp\perf\out.CSV"
$firstline = get-content $colsfile | select -first 1
$headers = for ($i = 0; $i -lt $cols.count - 1; $i++) {
$firstline.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()
}
Get-Content $colsfile | select -skip 1 | ForEach {
$hash = [ordered]@{}
$hash += @{ "proc_yyymm" = $proc_yyyymm }
for ($i = 0; $i -lt $headers.length; $i++) {
$value = $_.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()
if(! $value) { $value = 0 }
$hash += @{$headers[$i] = $value}
}
[pscustomobject]$hash
} | export-csv $csvfile -NoTypeInformation -Delimiter $delimiter
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 1043
This didn't perform badly, and actually pretty good for a "native" solution. It has the benefit of being flexible (via hash tables), and @js2010 does something that I love, by utilizing Export-CSV
it will output Correct CSV delimited files (or tab). The problem with the other above solutions is that the strings aren't wrapped in quotes to escape commas, or other quotes, from the text. Export-CSV
will do that.
Substring Executions
Since we know we are dealing with fixed length columns, we can use Substrings instead of Regular Expressions. Let's see if there is any performance improvement.
@BACON's next example with a basic substring execution:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
while ($line = $stream_in.ReadLine()) {
$id = $line.Substring( 0, 11).Trim()
$firstName = $line.Substring(11, 24).Trim()
$lastName = $line.Substring(35, 19).Trim()
$columnNameTooLong = $line.Substring(54, 17).Trim()
$fifthColumn = $line.Substring(71, 9).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
$stream_out.WriteLine($new_line)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 177
Wow, it looks like Substrings are running twice as fast as our baseline Regular Expression. This definitely looks like the path to go on.
Let's try @BACON's next example with a Extract Field function and multiple write statements:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
# If the first character of a field is whitespace, assume the
# entire field is as well to avoid a String allocation and Trim()
if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
return '0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
}
}
$proc_yyyymm = '201912'
$delimiter = ','
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$id = ExtractField $lineChars 0 11
$firstName = ExtractField $lineChars 11 24
$lastName = ExtractField $lineChars 35 19 $true
$columnNameTooLong = ExtractField $lineChars 54 17
$fifthColumn = ExtractField $lineChars 71 9
# Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
$stream_out.Write($proc_yyyymm)
$stream_out.Write($delimiter)
$stream_out.Write($id)
$stream_out.Write($delimiter)
$stream_out.Write($firstName)
$stream_out.Write($delimiter)
$stream_out.Write($lastName)
$stream_out.Write($delimiter)
$stream_out.Write($columnNameTooLong)
$stream_out.Write($delimiter)
$stream_out.WriteLine($fifthColumn)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 5675
Wow. Running 17 times slower, is it the function call or the multiple writing to file calls that are causing the slowness? if I concatenate the Write to file:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
# If the first character of a field is whitespace, assume the
# entire field is as well to avoid a String allocation and Trim()
if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
return '0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
}
}
$proc_yyyymm = '201912'
$delimiter = ','
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$id = ExtractField $lineChars 0 11
$firstName = ExtractField $lineChars 11 24
$lastName = ExtractField $lineChars 35 19 $true
$columnNameTooLong = ExtractField $lineChars 54 17
$fifthColumn = ExtractField $lineChars 71 9
# Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
$new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
$stream_out.WriteLine($new_line)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 5654
The slowness is definitely caused by the additional function calls.
EDIT:
Let's check @BACON's latest extremely flexible and expandable,
method with no function calls. Most of the time I would be willing to sacrifice some performance for flexibility:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
@{ Offset = 0; Length = 11; NormalizeIfEmpty = $false },
@{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
@{ Offset = 35; Length = 19; NormalizeIfEmpty = $true },
@{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
@{ Offset = 71; Length = 9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$stream_out.Write($proc_yyyymm)
foreach ($fieldDescriptor in $fieldDescriptors)
{
# If the first character of a field is whitespace, assume the
# entire field is as well to avoid a String allocation and Trim()
# If space is the only possible whitespace character,
# $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
$fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
-and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
) {
'0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
(
New-Object -TypeName 'String' -ArgumentList (
$lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
)
).Trim()
}
$stream_out.Write($delimiter)
$stream_out.Write($fieldText)
}
$stream_out.WriteLine()
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 2645
Wow. Removing the explicit Function call saved half the time. It's still slower than the baseline, but is it due to us creating multiple strings
or doing multiple Write()
operations?.
If, instead of multiple Write()
operations, let's accumulate it in a variable. It should be faster because it's all in memory right?
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
$fieldDescriptors = @(
@{ Offset = 0; Length = 11; NormalizeIfEmpty = $false },
@{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
@{ Offset = 35; Length = 19; NormalizeIfEmpty = $true },
@{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
@{ Offset = 71; Length = 9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$new_line = $proc_yyyymm
foreach ($fieldDescriptor in $fieldDescriptors)
{
$new_line += $delimiter
$new_line += if ($fieldDescriptor.NormalizeIfEmpty `
-and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
) {
'0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
(
New-Object -TypeName 'String' -ArgumentList (
$lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
)
).Trim()
}
}
$stream_out.WriteLine($new_line)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 2605
Well that was unexpected. I think in this case we are seeing the HDD buffering the write operations and masking what goes on. Dealing with larger data sets may change this, but somewhat confirms our thoughts that the read/write operations may have less of an impact on what we do than on how we are handling the many copies of strings in memory.
So we want to 1. Avoid function calls (they seem to generate lots more overhead), and 2. use substrings. Are there any other ways to do this?
I suspect that a lot of the overhead comes from assigning/reassigning memory. Basically, every time you manipulate a sting, e.g. .Substring
or .Trim
or assign to a variable you make a copy of the string in memory. What if we made things ugly and mashed everything into one string?
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
while ($line = $stream_in.ReadLine()) {
$lastName = $line.Substring(35, 19).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$new_line = $proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71, 9).Trim() -join $delimiter
$stream_out.WriteLine($new_line)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 146
Definitely an improvement. Let's try one more thing:
String Builder
One other class we can investigate is the String Builder class. Let's take the last example and use the String Builder:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
$sb = [System.Text.StringBuilder]::New(80)
while ($sb.Append($stream_in.ReadLine()).Length) {
$lastName = $sb.ToString(35, 19).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$new_line = $proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter
$stream_out.WriteLine($new_line)
[void]$sb.clear()
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 161
Overall, no real change in performance. The reason is that in this situation, we can't take advantage of some of the benefits String Builders provide. Because we need to Trim()
the data, which is something that StringBulider
can't do natively, we end up needing to convert it back into a string to perform the trim, which negates any performance improvement.
Super Optimize Performance
Now Let's get uglier. There is still one more variable assignment we can optimize out, Let's try with the StreamBuilder
again:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
$sb = [System.Text.StringBuilder]::New(80)
while ($sb.Append($stream_in.ReadLine()).Length) {
$lastName = $sb.ToString(35, 19).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$stream_out.WriteLine($proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter)
[void]$sb.clear()
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 72
Wow! It seems that even the final variable assignment caused a lot of additional overhead. Likely due to it allocating several more memory locations for the string.
And finally, let's do a pure String.Substring
implementation, with the same optimization:
$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$delimiter = ','
while ($line = $stream_in.ReadLine()) {
$lastName = $line.Substring(35, 19).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$stream_out.WriteLine($proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71, 9).Trim() -join $delimiter)
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch
ElapsedMilliseconds: 57
This is the current best option: String.Substring
and as little variable assignment possible, and it runs almost 6 times faster than the baseline RegEx. Now, from @Mark comments, if we can remove the Trim
requirement, we could try StringBuilder
again... Also, we could see improvements if we could read the whole file into memory (e.g. StringBuilder
the whole file) that would be interesting...
The Story of the StringBuilder Saga!
To answer one of @Mark questions in the comments below, why does the while
loop need a .Length
in it? isn't that useless?:
while ($sb.Append($stream_in.ReadLine()).Length)
Well, I thought so too.... Until I got stuck in an infinite loop.... but I didn't realize it was an infinite loop. I just thought I got something wrong and 10,000 lines of processing were spewing onto to the console. That is why you have:
`[void]$sb.clear()`
What you quickly realize when you start using it, the StringBuilder
class is one noisy little bugger (especially when you are a PowerShell console). It decides to outputs it's whole life story any chance it can. That means when you Clear()
it:
PS C:\> $sb.Clear()
Capacity MaxCapacity Length
-------- ----------- ------
80 2147483647 0
Or Append
data to it:
PS C:\Temp\perf> $sb.Append($stream_in.ReadLine())
Capacity MaxCapacity Length
-------- ----------- ------
80 2147483647 80
Like a drunk guy at a bar, -Every- -single- -time- StringBuilder
wants you to know how big it is and how much bigger it could be and who it is. And it says it to anyone that will listen. PowerShell is an enthusiastic listener, and will listen to anything anyone says, even if it is observing an Execution and sees bloody murder happening in front of them (see: $Error
stream also see: when my boss wanders over and sees a blood red PowerShell console, and I say to just "ignore all of that stuff... um... it's not important"... then try to convince him to let you run your script against the Exchange server ;-).
In this case StringBuilder
is full of itself, and every time it does something, it returns itself. PowerShell, being the receptive listener, will take the StringBuilder
object and proclaim its presence to your loyal PowerShell Console host.
When you are processing a 10,000 long test document, and have 2 StringBuilder
operations:
while ($sb.Append($stream_in.ReadLine())) {
...
$sb.Clear()
}
You start spewing out 20,000 lines of messages to the console. For the experienced PowerShell coder, like pushing a magic HotKey, this ensues a quick Ctrl+C
, open Vivaldi -> Google -> PowerShell StringBuilder Suppress Output
-> I'm Feeling Lucky -> Stack Overflow -> Ctlr+C
first answer: Aaand it looks like I'm going to start adding [void]
to all of my StringBuilder
operations to suppress output from now on:
while ([void]$sb.Append($stream_in.ReadLine())) {
...
[void]$sb.Clear()
}
Excellent. Let's continue to run tests:
ElapsedMilliseconds: 1
Wow!!! 1 Millisecond! Woot! StringBuilder
seems to be quantum level amazing! Let's check the output file:
PS C:\> Get-ChildItem .\out.txt
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2019-12-12 3:44 PM 0 out.txt
0Kb
... Well nuts. On a quantum level, we forgot to put the cat into the box.
This is where careful observers will notice the fatal flaw in my Google Stack Overflow Copy Paste skills:
while ([void] ... ) {
Which, while mathematically equivalent to a O(0)
operation, it is also known as skipping the while loop or, or also known as not doing anything. Not doing anything sort of doesn't solve the problem.
Well, ok, we'll get rid of the [void]
and let's fall back on our tried and true tools when dealing with undesirable output. First just redirect the output:
while ($sb.Append($stream_in.ReadLine()) >> SomeWhereElse.txt ) {
0Kb
... Or um... Maybe pipe it to Out-Null
?
while ($sb.Append($stream_in.ReadLine()) | Out-Null) {
0Kb
... grr... I should know they all functionally equate to:
while ( $null ) {
(Breath in) ok, let's think about this. If this is being run headless, I don't have to worry about a messy console. So let's just let it spew out 10,000 lines of stuff. I know writing to the console makes the execution 10 times longer, but what I'm really curious about right now is the speed.
while ($sb.Append($stream_in.ReadLine())) {
...
[void]$sb.Clear()
}
Aaand run.
It's running. The console is just spewing out messages about the StringBuilder
and I'm chuckling about how clever I was to think about using the StringBuilder
class. It's running... hee hee hee... so few people think about the differences between Immutable vs. Mutable strings... hee hee hu? ... It's still running.... even if it ran 10 times slower it should be done by now... ok Ctrl+C
-> Up Arrow
-> Enter
. Let's run the same thing again, and expect a different outcome.
Aaaand... What do you know... the same.
This is when you look at your PowerShell console. You silently know that it has a Screen Buffer Size of 9,999, and it has completely wrapped... probably... 3 times. Far more than it should. And when there is a while
loop involved, you know that you messed up on the condition statement, and you have expertly crafted an infinite loop.
Let's go back to the previous example that we based it on:
while ($line = $stream_in.ReadLine()) {
...
We are essentially saying while $stream_in.ReadLine()
has data loop. Naturally, we do the same with StringBuilder
, only this time we know that we have to append it:
while ($sb.Append($stream_in.ReadLine())) {
...
I'm doing the same thing... right?
After some repeated swearing, fruitless googlefu, I gave up and plugged in a length check:
while ($sb.Append($stream_in.ReadLine()).Length) {
...
And it worked as expected. But I wasn't happy that I had to do the check. It's not supposed to work that way. It was supposed to work the easy way. I had the night to think about it, and in the morning I finally understood what I had done wrong, and why it wasn't working like that.
Remember earlier when I said that StringBuilder
is full of itself? Well, it's full of itself and not 0
. The thousands of messages in the console were saying what was wrong:
Capacity MaxCapacity Length
-------- ----------- ------
80 2147483647 0
Length = 0
The StringBuilder
is empty. It was empty 21,000 lines ago.
The while
loop condition statement wasn't evaluating on a String
that either has data or 0
/$null
data, it was being fed a StringBuilder
object. the StringBuilder
object is not a 0
or a bool
it's a big fat object which must be $true
. The while
loop condition statement would always evaluate to $true
every time, even long after the StreamReader
ran out of data.
Adding in a .Length
made sure that the StringBuilder
would return an actual number to the while
loop, and ensure that it would stop when it ran out of data.