4

I am working on a document management system and in order to detect changes in files/duplicates of files I am using sha256 to get the digests for comparison. This is being done in python. The system can be configured to encrypt the files before storage.

The question is whether it is still safe to store the digest for the unencrypted file.

This digest is used as an identifier for the stored files and is also used to detect if the file being added to the system already exists. I am okay with the chance of collision of sha256 algorithm for this purpose. I have also read that the digest produced by sha256 cannot be used to recreate the original data.

Assuming the file cannot be reconstructed from the hash and the fact that the file is stored in encrypted form, it should be safe to keep the original hash for comparisons/searching right... or should I rethink my strategy? these comparisons are going to be internal to the application and will not be exposed to the user in anyway.

L0stLink
  • 100
  • 4
  • 1
    Do you assume that an attacker replaces a file with an old version? This is called roll-back attack. – kelalaka Sep 25 '19 at 09:26
  • 2
    You'll probably have better luck asking this question on [InfoSec.SE](https://security.stackexchange.com/) – Aran-Fey Sep 25 '19 at 09:27
  • 1
    What's your concern exactly? Reversibility? Some sort of hash-collision based attack? – deceze Sep 25 '19 at 09:46
  • 2
    Is the file unique? Because if it's a file available elsewhere, it's likely its hash is also found in a dictionary. – Yann Vernier Sep 25 '19 at 09:48
  • 1
    As long as the files you store are pure binary files of a certain length (not very small ascii text files that are smaller than e.g. 12 bytes) the file data can not be reconstructed. For small ascii files (that may be even identifiable by the file size also stored unencrypted) storing the SHA-256 may cause security problems (brute force generation of all possible documents). Therefore I would set a minimum size of ~20 bytes for the documents. – Robert Sep 25 '19 at 12:01
  • The system is secured behind a login and actions which cause mutation of stored data are logged and are reversible (think git). It can be assumed that a person capable of logging in to the system is trusted (enough to the point allowed by the permissions granted to the user). @Robert as for the files, they will be at least a few hundred bytes typically in the KiB and MiB range. – L0stLink Sep 25 '19 at 14:33
  • @deceze my concern is whether or not it is safe to use the hash of the unencrypted file to identify the encrypted file considering that the files will be saved to the disk and will have a GUID as name linked to a database table with the hash stored to reference it. Can storing the hash of the unencrypted files with direct reference to the encrypted files compromise system security? – L0stLink Sep 25 '19 at 14:45
  • For normal scenarios it's not a problem, but if these are files that noone else should have (like a a keepass database), by storing the hash before encryption, that can give someone an confidence that it's the file they retrieved somehow. For example, from hard disk blocks from mistakenly written temporary files, they can reconstruct *and* ensure they have the correct file. – chexum Feb 17 '20 at 11:10

1 Answers1

1

Preimage resistence of SHA-256 is 2^256, and collision resistance is 2^128 (brief summary). On the other hand, you can simply check the number of combinations needed to guess the key to decrypt the file. SHA-256 preimage attack complexity is comparable to cracking 256-bit key for symmetric encryption. So, in general, I'd say, this approach is secure enough, because it's easier to restore the original file by guessing the key rather than finding preimage from SHA-256.

Would be good to know which algorithm and parameters you're going to use for file encryption, maybe in your case the answer would be different.

Oleh Rybalchenko
  • 6,998
  • 3
  • 22
  • 36