0

I'd like to use git to track Media files as long as associated playlists. Tracking playlists is easy, cause these are text files. About the binary files, I've already taken a look at git-lfs and git-annex, but would want to explore the following way:

Flac files provide an internal md5 hash. Such hash may be accessed through

metaflac --show-md5sum filename.flac

With performance in mind, I'd like to ask git to use "flac md5 hash", not the git internal hash.

How is it possible to do such thing ?

I've read the gitattributes documentation but did not find the answer.

PS: 1st goal is to get lightning fast performance. 2nd goal is that any metadata change to a file would be ignored.

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
kalou.net
  • 446
  • 1
  • 4
  • 16

2 Answers2

1

There is no way to use a custom hash function to identify objects in Git. There is ongoing work to switch to SHA-256, but it is not a general-purpose framework for substituting your own hash function.

CPU usage in Git is not dominated by hashing; it's dominated by compression. Using a different hash function, even if it were possible, would not produce significant performance benefits. (I've run the numbers myself, as have other Git contributors.)

In addition, MD5 is extremely weak (even weaker than SHA-1) and it shouldn't be used for any purpose whatever nowadays. If you need a fast hash, BLAKE2b is faster than MD5, actually secure, and can be adjusted to an arbitrary length.

bk2204
  • 64,793
  • 6
  • 84
  • 100
  • Ok thank you very much for the information. However, If I may suggest, what if (rather large) binaries handling of media files is not supposed to be exactly the same "computing profile" that (rather short) text files ... compressing mp3 makes no sense, would be =off, and "check summing" would be replaced with "look for the hash in the header of file" ... both cpu costly operations removed, blazing fast performance may be achieved.. Isn't it ? – kalou.net Jan 09 '20 at 20:17
  • If you have large binary files you want to store, Git LFS is a good choice. However, I'm not aware of any tools that parse files to look for internal checksums. That requires a lot of code specific to your use case and file format. I don't believe either Git or Git LFS will add that functionality, so you should adopt a different approach. – bk2204 Jan 09 '20 at 23:10
1

git-annex is the tool for that. git-annex somewhat recently released the external backend protocol.

So you could check for that md5 hash in your .flac files and just return it whenever GENKEY was called on a .flac file. You'll also have to figure out how that md5 digest was calculated, so you're able to reproduce it on VERIFYKEYCONTENT calls.

git in itself was not designed to handle binary files, nor with the extensibility design consideration to allow for that in the future.

Here is some additional discussion on the topic. It's also not the first someone had the idea of using pre-computed md5 sums for the content-addressed blob-storage.

Nei Neto
  • 438
  • 2
  • 9