2

I have a scenario where I need to obtain an installer embedded within a JSON REST response that is base64-encoded. Since the size of the JSON string is rather large (180 MB), it causes problems when decoding the REST response using standard PowerShell tooling as it causes OutOfMemoryException to be thrown quite often in limited memory scenarios (such as hitting WinRM memory quotas).

It's not desirable to raise the memory quota in our environment over a single installation, and we don't have standard tooling to prepare a package whose payload does not exist at a simple HTTP endpoint (I don't have direct permissions to publish packages not performed through our build system). My solution in this case is to decode the base64 string in chunks. However, while I have this working, I am stuck on one last bit of optimization for this process.


Currently I am using a MemoryStream to read from the string, but I need to provide a byte[]:

# $Base64String is a [ref] type
$memStream = [IO.MemoryStream]::new([Text.Encoding]::UTF8.GetBytes($Base64String.Value))

This unsurprisingly results in copying the byte[] representation of the entire base64-encoded string, and is even less memory-efficient than built-in tooling in its current form. The code you don't see here reads from $memStream in chunks of 1024 bytes at a time, decoding the base64 string and writing the bytes to disk using BinaryWriter. This all works well, if slow since I'm forcing garbage collection fairly often. However, I want to extend this byte-counting to the initial MemoryStream and only read n bytes from the string at a time. My understanding is that base64 strings must be decoded in chunks of bytes divisible by 4.

The problem is that [string].Substring([int], [int]) works based on string length, not number of bytes per character. The JSON response can be assumed to be UTF-8 encoded, but even with this assumption UTF-8 characters vary between 1-4 bytes in length. How can I (directly or indirectly) substring a specific number of bytes in PowerShell so I can create the MemoryStream from this substring instead of the full $Base64String?

I will note that I have explored the use of the [Text.Encoding].GetBytes([string], [int], [int]) overload, however, I face the same issue in that the method expects a character count, not byte count, for the length of the string to get the byte[] for from the starting index.

codewario
  • 19,553
  • 20
  • 90
  • 159
  • [This](https://stackoverflow.com/questions/2525533/is-there-a-base64stream-for-net) might be helpful (`CryptoStream` and `FromBase64Transform`). As this exposes a stream as-is, you can probably then just use `Stream.CopyTo` instead of writing any code of your own. – Jeroen Mostert Jun 13 '22 at 16:05
  • @JeroenMostert Unless I'm missing something, I would still need to prepare the base64 string as some form of `Stream` to create the `CryptoStream` object, wouldn't I? If that's the case I still have the same problem presented in selecting only `n` bytes of the string for the stream. – codewario Jun 13 '22 at 16:50
  • Don't you get the JSON REST response as a `Stream`? If not, you should certainly be able to arrange it as such. – Jeroen Mostert Jun 13 '22 at 17:47
  • @JeroenMostert I'm using `Invoke-WebRequest` right now, but you are right I could tackle this using `WebClient` and reading the response into a `Stream`. I did find a solution to my question as presented (I am working on a function which I'll share when done) but I will probably wind up writing my own web request function and consume the stream directly since it still doesn't completely solve the memory issues I'm facing. – codewario Jun 13 '22 at 18:29
  • Yes, if you are under such restrictions that even processing 180 MB is too much, it sounds like you want a solution that fully streams from front to back and explicitly never stores intermediary data, so you only spend memory on the buffers (which tend to have configurable size). Solutions where you have to manually chunk input are never fun, regardless of technology. – Jeroen Mostert Jun 13 '22 at 18:34

1 Answers1

0

To answer the base question "How can I substring a specific number of bytes from a string in PowerShell", I was able to write the following function:

function Get-SubstringByByteCount {
  [CmdletBinding()]
  Param(
    [Parameter(Mandatory)]
    [ValidateScript({ $null -ne $_ -and $_.Value -is [string] })]
    [ref]$InputString,
    [int]$FromIndex = 0,
    [Parameter(Mandatory)]
    [int]$ByteCount,
    [ValidateScript({ [Text.Encoding]::$_ })]
    [string]$Encoding = 'UTF8'
  )
  
  [long]$byteCounter = 0
  [System.Text.StringBuilder]$sb = New-Object System.Text.StringBuilder $ByteCount

  try {
    while ( $byteCounter -lt $ByteCount -and $i -lt $InputString.Value.Length ) {
      [char]$char = $InputString.Value[$i++]
      [void]$sb.Append($char)
      $byteCounter += [Text.Encoding]::$Encoding.GetByteCount($char)
    }

    $sb.ToString()
  } finally {
    if( $sb ) {
      $sb = $null
      [System.GC]::Collect()
    }
  }
}

Invocation works like so:

Get-SubstringByByteCount -InputString ( [ref]$someString ) -ByteCount 8

Some notes on this implementation:

  • Takes the string as a [ref] type since the original goal was to avoid copying the full string in a limited-memory scenario. This function could be re-implemented using the [string] type instead.
  • This function essentially adds each character to a StringBuilder until the specified number of bytes has been written.
  • The number of bytes of each character is determined by using one of the [Text.Encoding]::GetByteCount overloads. Encoding can be specified via a parameter, but the encoding value should match one of the static encoding properties available from [Text.Encoding]. Defaults to UTF8 as written.
  • $sb = $null and [System.GC]::Collect() are intended to forcibly clean up the StringBuilder in a memory-constrained environment, but could be omitted if this is not a concern.
  • -FromIndex takes the start position within -InputString to begin the substring operation from. Defaults to 0 to evaluate from the start of the -InputString.
codewario
  • 19,553
  • 20
  • 90
  • 159