5

I have read multiple times about the integrity mechanism in git based on SHA-1 hashes and links to parent commits, which ensures that no changes are made to the committed data in the git repository.

My question is: During which operations does git check that the hashes are valid, i. e. match the content of the commits? Is a check performed during a push or maybe a pull? Unfortunately, I haven't found any information on this.

Jeff S.
  • 888
  • 12
  • 17
  • Since every SHA1 sum actually matches a single object, that is an individual compressed file, my guess is that the sum is probably recomputed on-the-fly each time this file is written on the disk or when it'is fully read, by monitoring the data flow. This would include pushes when you do, pulls from a remote server when you receive them, when you perform `git checkout`s or any command that applies on a object, such as `git show`. This is what I would personally do. – Obsidian Jun 01 '18 at 20:33

1 Answers1

4

Obsidian's comment is spot-on: the name of each Git object is the hash ID of the object's content, so anything that uses the ID to look up and read the content can, and usually does, verify that the hash of the extracted data matches the ID used as a key to extract that data.

Additional checking—verifying that the GPG signature in a tag or commit—is only done when you specifically request it. You can request that git log check such signatures by default, using the log.showSignature configuration setting.

Note that the integrity of any node in a Merkle tree depends on whether you trust prior nodes against second-preimage attacks. If you use GPG-signed tags, the signatures in those tags protect each tag's data (to whatever degree you trust GPG itself), and then the tag protects its commit object (to whatever degree you trust SHA-1). The commit object in turn protects its tree, which protects its subtrees and blobs, and the blob hashes protect their contents. So you should do a different kind of analysis if you're concerned with second-preimage attacks. If you're just concerned with random data corruption (as seen on spinning media and/or non-ECC memory), you can just use the SHA-1 hash directly the way Git does.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Ok, thanks for the detailed answer. It is good to know that the hashes are verified by almost any operations on the hashed objects. I'm surprised, I didn't find this information in the official documentation,but maybe it is seen as an implementation detail. – Jeff S. Jun 03 '18 at 21:33
  • May I add some stupid question there? Does it also verify that files are not corrupted during `git clone`, for example? If I store some big file in git repo, should manually do some checksum checks after `git clone` or I'm sure that everything is fine if cloning had no errors? – Nikita Popov Dec 12 '20 at 09:29
  • 1
    @NikitaPopov: Everything in Git is stored as an "object". Each object has a hash ID, which we compute using some sort of checksum (SHA-1 currently but Git 2.29 already has prototype support for other hash algorithms). Git doesn't store *files*, just these objects. The extraction code *must* be told the hash ID, because that's how we look up an object in the key-value database that stores them. The extraction code then computes the hash again during extraction. This *must* match the key used to look up the object. If it does, the object is not corrupted. – torek Dec 12 '20 at 09:32
  • 1
    The actual re-checking of object hash IDs during `git clone` is slightly deferred, but this is an implementation detail: `git clone` consists of six steps, the fifth step being a `git fetch` that obtains all reachable objects. The sender can send those objects one at a time, in which case they'd get checked as they go into storage, but that's terribly slow, so modern protocols send a so-called *pack file* with delta-compressed objects in it. The receiving Git then runs a pack indexer, which reads and checks all the objects. Hence clone finds any errors during the indexing. – torek Dec 12 '20 at 09:36
  • @torek Are you (and everybody else) sure, that Git actually checks the hash during the clone? I assumed you would have to first enable [`transfer.fsckObjects`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-transferfsckObjects), which is disabled by default. – JojOatXGME Jun 16 '23 at 16:48