1

I am having a hard time with powershell (because I am learning it in the run). I have huuuge amount of data and I am trying to find a unique identifier for every folder with data. I wrote a script which is just MD5-ing every folder recursively and saving the hash value for every folder. But as you might have already thought it is super slow. So I thought that I will hash only the metadata. But I have no idea how to do this in powershell. The ideas from the internet are not working and they return always the same hash value. Has anyone had similar problem? Is there a magic powershell trick to perform such task?

Sorry for lack of precision.

I have a big ~20000 list of folders. In every folder there are unique data, photos, files etc. I iterated through every folder and counted hash of every file (I actually made a crypto-stream here so I had a one hash for the data). This solution is taking ages.

The solution I wanted to adopt was using the metadata. Like those from this command:

Get-ChildItem -Path $Env:USERPROFILE\Desktop -Force | Select-Object -First 1 |  Format-List *

But hashing this always gives me the same value even when something changed. I have to have a possibility to chceck if nothing has changed in those files.

  • 2
    [Edit] the question and explain with more details what you are trying to achieve. A hash might (not) be the best tool anyway. – vonPryz Aug 15 '22 at 13:59
  • 1
    What exactly do you mean when you say "the metadata"? The file name and size? – Mathias R. Jessen Aug 15 '22 at 14:05
  • 1
    Does this [answer](https://stackoverflow.com/a/10521162/4190564) help? It gives an example of getting an MD5 of a string. Also, I agree with vonPryz, there are some things you are saying that I don't get what you mean. For example, "saving the hash value for every folder" - normally Get-FileHash gets has values for files, not folders. – Darin Aug 15 '22 at 14:08
  • 1
    Another thought on this, perhaps you need to define the overall goal. What exactly are you trying to do? Are you wanting some way to identify the exact same folder on 2 different hard drives? Is that why you don't want to just use the folders path and name as an identifier? Yet another thought on this, asynchronous code is normally faster than synchronous. You may need to create a function to get the unique ID/signature of a folder and then call it in such a way that it runs in the background while you do other work. – Darin Aug 15 '22 at 14:31
  • Thanks for comments. I edited the question. Get-FileHash is too slow in this scenario I haveto go faster. Yes I want to create a signature for a folder but how could I achieve this? – enter_username_1 Aug 15 '22 at 15:19
  • As defined in the PowerShell Help files the Get-FileHash is for files only. Get-FileHash --- [Computes the hash value for a file by using a specified hash algorithm.](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash?view=powershell-7.2) -- So, the cmdlet has to read every file in all folders you target. Hence why it takes time. The more files you have, the more time it will take unless you are using a parallel/job process. You have to write your own code for that folder using case using .Net or 3rd party tool – postanote Aug 15 '22 at 22:07
  • Metadata is data about an object, not really a target for hashing, as the object as a whole is. – postanote Aug 15 '22 at 22:52

2 Answers2

0

Continuing from my comment.

As per this resource

3rdP tool: http://www.idrix.fr/Root/Samples/DirHash.zip

function Get-FolderHash ($folder) 
{
    dir $folder -Recurse | ?{!$_.psiscontainer} | 
    %{[Byte[]]$contents += [System.IO.File]::ReadAllBytes($_.fullname)}

    $hasher = [System.Security.Cryptography.SHA1]::Create()

    [string]::Join("",$($hasher.ComputeHash($contents) | 
    %{"{0:x2}" -f $_}))
}

Note, that I've not tested/validated either of the above and will leave that to you.

Lastly, this is not the first time this kind of question has been asked via SO, using the default cmdlet and some .Net. So, this could be seen/markerd as a duplicate.

$HashString = (Get-ChildItem C:\Temp -Recurse | 
Get-FileHash -Algorithm MD5).Hash | 
Out-String
Get-FileHash -InputStream ([IO.MemoryStream]::new([char[]]$HashString))

Original, faster but less robust, method:

$HashString = Get-ChildItem C:\script\test\TestFolders -Recurse | Out-String
Get-FileHash -InputStream ([IO.MemoryStream]::new([char[]]$HashString))

could be condensed into one line if wanted, although it starts getting harder to read:

Get-FileHash -InputStream ([IO.MemoryStream]::new([char[]]"$(Get-ChildItem C:\script\test\TestFolders -Recurse|Out-String)"))

Whether it's faster or fast enough for your use case is a different matter. Yet, it does address ensuring you get a different hash based on target folder changes.

postanote
  • 15,138
  • 2
  • 14
  • 25
0

First, create an MD5 class that does not create a new instance of System.Security.Cryptography.MD5 every time we create an MD5 from a string.

class MD5 {
    static hidden [System.Security.Cryptography.MD5]$_md5 = [System.Security.Cryptography.MD5]::Create()
    static [string]Create([string]$inputString) {
        return [BitConverter]::ToString([MD5]::_md5.ComputeHash([Text.Encoding]::ASCII.GetBytes($inputString)))
    }
}

Second, figure out a way to use each child items Name, Length, CreationTimeUtc, and LastWriteTimeUtc to create unique ID text per each child in the folder, merge into a single string and create an MD5 based on that resulting string.

  1. Get the child objects of a folder.
  2. Select only certain properties, returning the content as a string array.
  3. Join the string array into a single string. No need for joining with newline.
  4. Convert the string into an MD5.
  5. Output the newly created MD5.
$ChildItems = Get-ChildItem -Path $Env:USERPROFILE\Desktop -Force
$SelectProperties = [string[]]($ChildItems | Select-Object -Property Name, Length, CreationTimeUtc, LastWriteTimeUtc)
$JoinedText = $SelectProperties -join ''
$MD5 = [MD5]::Create($JoinedText)
$MD5

Alternately, join the above lines into a very long command.

$AltMD5 = [MD5]::Create([string[]](Get-ChildItem -Path $Env:USERPROFILE\Desktop -Force | Select-Object -Property Name, Length, CreationTimeUtc, LastWriteTimeUtc) -join '')
$AltMD5

This resulting MD5 should be a unique signature of a folder's contents, not the folder itself, but only of the contents. So, you could in theory change the name of the folder itself and this MD5 would remain the same.

Not exactly sure how you aim to use this, but be aware that if any file, or sub-folder, in the folder changes, the MD5 for the folder will also change.

Darin
  • 1,423
  • 1
  • 10
  • 12