0

I am putting together a program that takes care of moving (large) files from one place to another. These files are usually 1gb + and are incredibly important to us. We are a data acquisition company, so data is literally our product.

What I'd like to do is calculate MD5 (or some other validation method) -> Copy/move the file to it's destination -> compare the original and copied file's MD5 (or other)

Since calculating the MD5 requires reading the whole file, I was wondering if there was a way to combine it with the actual copy of the file, requiring it to be read beginning to end only once.

Also, the transfers will likely be from one network location to another, so if there is a faster/lighter (than MD5) way to validate both files are identical, please let me know! I'd like to prevent bogging down the network if I can.

P.S. It's important that the whole file not be stored in memory as some of them can get as big as 300 GB.

Matthew Goulart
  • 2,873
  • 4
  • 28
  • 63
  • Hmm .. couldn't you just calculate MD5 for each tcp package sent and received? Not sure about the efficiency tho. You would also have to make sure the packages are always the same size. – krizajb Nov 02 '17 at 22:03
  • @krizajb Everything has to happen on the same machine. I can't get the destination machine to calculate an incoming file, so I can't get it to check incoming tcp packets. – Matthew Goulart Nov 02 '17 at 22:05

1 Answers1

0

My SplitStream, can do the first two thing with one stream.

using (var inputSplitStream = new ReadableSplitStream(inputSourceStream))

using (var inputFileStream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputFileStream = File.OpenWrite("MyFileOnAnyFilestore.bin"))

using (var inputSha1Stream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputSha1Stream = SHA1.Create())
{
    inputSplitStream.StartReadAhead();

    Parallel.Invoke(
        () => {
            var bytes = outputSha1Stream.ComputeHash(inputSha1Stream);
            var checksumSha1 = string.Join("", bytes.Select(x => x.ToString("x")));
        },
        () => {
            inputFileStream.CopyTo(outputFileStream);
        },
    );
}

github: https://github.com/microknights/SplitStream

I have not tested it on such large files tough, but give it a try

But the last validation requires one more read, i dont think you can avoid that.

Frank Nielsen
  • 1,546
  • 1
  • 10
  • 17