PowerShell 7.0 how to compute hashsum of a big file read in chunks

Question

The script should copy files and compute hash sum of them. My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load. So I tried read a chunk of file then compute md5 hash sum and write out file to several places. The file`s size may vary from 100 MB up to 2 TB and maybe more. There is no need to check files identity at this moment, just need to compute hash sum for initial files.

And I am stuck with respect to computing hash sum:

    $ifile = "C:\Users\User\Desktop\inputfile"
    $ofile = "C:\Users\User\Desktop\outputfile_1"
    $ofile2 = "C:\Users\User\Desktop\outputfile_2"
    
    $md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
    $bufferSize = 10mb
    $stream = [System.IO.File]::OpenRead($ifile)
    $makenew = [System.IO.File]::OpenWrite($ofile)
    $makenew2 = [System.IO.File]::OpenWrite($ofile2)
    $buffer = new-object Byte[] $bufferSize
    
    while ( $stream.Position -lt $stream.Length ) {
       
     $bytesRead = $stream.Read($buffer, 0, $bufferSize)
     $makenew.Write($buffer, 0, $bytesread) 
     $makenew2.Write($buffer, 0, $bytesread) 
    
     # I am stuck here
     $hash = [System.BitConverter]::ToString($md5.ComputeHash($buffer)) -replace "-",""      
            
            }
    
    $stream.Close()
    $makenew.Close()
    $makenew2.Close()

How I can collect chunks of data to compute the hash of whole file?

And extra question: is it possible to calculate hash and write data out in parallel mode? Especially taking into account that workflow {parallel{}} does not supported from PS version 6 ?

Many thanks

And how I can be sure that function read initial file only once? — kostyan, Aug 27 '20 at 18:58

Mathias R. Jessen · Answer 1 · 2020-08-28T09:21:27.253

If you want to handle input buffering manually, you need to use the TransformBlock/TransformFinalBlock methods exposed by $md5:

while($bytesRead = $stream.Read($buffer, 0, $bufferSize))
{
    # Write to file copies
    $makenew.Write($buffer, 0, $bytesread) 
    $makenew2.Write($buffer, 0, $bytesread)

    # Feed next chunk to MD5 CSP
    $null = $md5.TransformBlock($buffer, 0 , $bytesRead, $null, 0)
}

# Complete the hashing routine
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)

# Grab hash value from CSP
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')

My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load

I'm not entirely sure what you mean by network load here. If the source file is on a remote file share, but the new copies go onto a local file system, you can minimize network load by simply copying the source file once, then use that one copy as the source of the second copy and the hash calculation:

$ifile = "\\remoteMachine\c$\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
    
# Copy remote -> local
Copy-Item -Path $ifile -Destination $ofile
# Copy local -> local
Copy-Item -Path $ofile -Destination $ofile2

# Hash local file stream
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$stream = [System.IO.File]::OpenRead($ofile)
$hash = [BitConverter]::ToString($md5.ComputeHash($stream)).Replace('-','')

FWIW, passing the file stream object to $md5.ComputeHash($stream) directly is likely going to be faster than manually buffering the input

Yeah, it works. Just a little bit typo, you give `$hash = [BitConverter]::ToString($md5).Replace('-','')` and should be like this `$hash = [BitConverter]::ToString($md5.Hash).Replace('-','') ` — kostyan, Aug 28 '20 at 00:24
You are right the simplest way to do that is using `Copy-Item` and `Get-FileHash` But I have 3 iteration of reading for each file. Both path are network storage. So in case of 2TB files read the same data three times in a raw is not the best way. I tried read data only once in a chunk and then throw this chunks to MD5 and copy functions. I guess it should be a little bit faster rather read the same data each time for other operations. So now, I would like to try parallel hash computing and write-out functions... — kostyan, Aug 28 '20 at 00:46

score 0 · Accepted Answer · answered Aug 28 '20 at 00:28

Final listing

$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"

$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 1mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize

while ( $stream.Position -lt $stream.Length ) 
{
     $bytesRead = $stream.Read($buffer, 0, $bufferSize)
     $makenew.Write($buffer, 0, $bytesread) 
     $makenew2.Write($buffer, 0, $bytesread) 
    
     $hash = $md5.TransformBlock($buffer, 0 , $bytesRead, $null , 0)  
} 

$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')      
$hash
$stream.Flush()
$stream.Close()
$makenew.Flush()
$makenew.Close()
$makenew2.Flush()
$makenew2.Close()

PowerShell 7.0 how to compute hashsum of a big file read in chunks

2 Answers2