1

I want to figure out if given a set of files, if there happened to be a change in any of those files.

I know for a single file you can use this approach which gets a checksum value that you can use to check if a change happened. I.e. This returns same value for a given file until something is changed in that file then it'll generate a different hash:

byte[] hashBytes;
using(var inputFileStream = File.Open(filePath))
{
    var md5 = MD5.Create();
    hashBytes = md5.ComputeHash(inputFileStream);
}

string s = Convert.ToBase64String(hashBytes);

Is there a way to get a collection of hash values and get a hash from that collection?

List<byte[]> hashCollection = SomeFunctionThatReturnsListByteArray();
//some approach that can create a hash of this

My main goal is to detect if a change happened. I don't care which file changed.

elllo
  • 79
  • 2
  • 9

1 Answers1

2

Hashing hashes is not optimal. However, if you didn't want to hash all the files together, you could easily just add your hashes to a memory stream and hash that.

Disregarding any other problem conceptual or otherwise.

public static byte[] Hash(IEnumerable<byte[]> source)
{
   using var hash = SHA256.Create();
   var ms = new MemoryStream();
   foreach (var bytes in source)
      ms.Write(bytes, 0, bytes.Length);
   ms.Seek(0, SeekOrigin.Begin);
   return hash.ComputeHash(ms);
}

Note : I am not professing this is the best solution, it's just a solution to your immediate problem

A slightly less allocatey approach

public static byte[] Hash(IList<byte[]> source)
{
   using var hash = SHA256.Create();
   var ms = new MemoryStream(source.Sum(x =>x.Length));
   foreach (var bytes in source)
      ms.Write(bytes, 0, bytes.Length);
   ms.Seek(0, SeekOrigin.Begin);
   return hash.ComputeHash(ms);
}

For a multi file hash (untested)

public static byte[] Hash(IEnumerable<string> source)
{

   using var hash = SHA256.Create();
   hash.Initialize();

   // adjust to what is fastest for you, for hdd 4k to 10k might be appropriate.
   // for ssd larger will likely help
   // probably best to keep it under 80k so it doesn't end up on LOH (up to you)
   const int bufferSize = 1024 * 50; 

   var buffer = new byte[bufferSize];
   foreach (var file in source)
   {
      using var fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Delete, bufferSize, FileOptions.SequentialScan);
      var bytesRead = 0;
      while ((bytesRead = fs.Read(buffer, 0, bufferSize)) != 0)
         hash.TransformBlock(buffer, 0, bytesRead, buffer, 0);
      hash.TransformFinalBlock(buffer, 0, 0);
   }

   return hash.Hash;
}
halfer
  • 19,824
  • 17
  • 99
  • 186
TheGeneral
  • 79,002
  • 9
  • 103
  • 141
  • I wouldn't be opposed to hashing all the files together. I just don't know how to approach that outside of doing one of two things. 1. kind of what I'm trying here where I get hash of their hashes. 2. getting the raw text of all the files and merging them into one source which I won't be able to do. If there's another way to approach this I'm open to it. – elllo Feb 12 '21 at 04:03
  • It throws an exception about: System.Security.Cryptography.CryptographicException: 'Hash not valid for use in specified state. On the 2nd loop of the foreach(var file in Source) loop. Edit update -> actually I tried MD5 instead of SHA256. It looks like that does not throw the exception. Now researching to see what the diff is beyond them being different hash methods. – elllo Feb 13 '21 at 01:31
  • How can one calculate the checksum if file is in use by other process.? – Hesoti Jan 02 '22 at 13:57