1

Let's consider a site where users can upload files. Can I use MD5 or SHA1 hashes of their contents as filenames? If not, what should I use? To avoid collisions.

x-yuri
  • 16,722
  • 15
  • 114
  • 161
  • A GUID? else 2 users uploading the same file is a collision. – Alex K. Aug 18 '17 at 17:28
  • @AlexK: Depends, on write Django by default chooses a unique filename (appending to the given one if needed). – dhke Aug 18 '17 at 17:32
  • Do you want the deduplication? Otherwise, use a random shuffle with sufficient number of bits. – dhke Aug 18 '17 at 17:33
  • 1
    Actually, I'm now using `NamedTemporaryFile` (for importing data from the old database). But using a hash would simplify some matters. Probably not much. But still, can I resort to a hash? @dhke I want to avoid collissions. Where a user uploads a file which overwrites a file by another user. – x-yuri Aug 18 '17 at 18:30

2 Answers2

3

You can use almost anything as a filename, minus reserved characters. Those particular choices tell you nothing about the file itself, aside from its hash value. Provided they aren't uploading identical files, that should prevent file naming collisions. If you don't care about that, have at it.

Usually people upload files in order for someone to pull them back down. So you'd need to have a descriptor of some kind; otherwise users would need to open a mass of files to get the one they want. Perhaps a better option would be to let the user select a name (up to a character limit) and then append the datetime code. Then, in order to have a collision, you'd need to have 2 users select the exact same name at the exact same time. Include seconds in the datetime code, and the chances of collision approach (but never equal) zero.

baldprussian
  • 164
  • 6
  • From what I know, hashes are prone to collisions. That is, files might have different contents, but still have the same hashes. And even if files are identical, they're to belong to different people. Each must have each own copy. That's just simpler, than assume, that file might belong to several people. Like, extra checks when deleting a file. Speaking of what you suggest, that just makes users' life harder. There might be some exceptions, but more often then not it doesn't make much sense to ask user for a sane filename. Correct me if I'm wrong. – x-yuri Aug 18 '17 at 18:37
  • On second thought, you might be right. It probably generally makes sense to use meaningful filenames. Which is probably also good for SEO. – x-yuri Apr 03 '18 at 22:05
2

Despite the SHA1 collision attack previously, SHA1 hash collision probability is still so low that can be assumed to be safe to use as filenames in most cases.

The other common approach is using GUID/UUID for every file. So the only question left is how do you want to handle two identical files uploaded by two users. The easiest way is treat them as two separate files and neither of them will be affected by each other.

Though sometimes you might be concerned about storage space. For example, if the files uploaded are really big, you might want to consider storing the two identical files as one to save space. Depending on the user experience of your system, you might need to handle some situations afterwards, such as when one of the two users removed the file. However these are not difficult to handle and just depend on the rest of your system.

Phillip Chan
  • 31
  • 1
  • 2
  • I believe I ran into an issue when using MD5, and if my memory serves me right it was not about deletion, but about files being not identical. Although, I might be wrong. Anyways, when using MD5, you'd better write code that expect collisions to happen. – x-yuri Sep 12 '18 at 05:07
  • @x-yuri That's interesting to know. Wonder how much is the file uploading usage of your system tho. – Phillip Chan Sep 12 '18 at 15:36
  • I'm not even sure what project that was. And not sure what I said above is true. I suppose that there was some issue that made me think about it, but again not sure. The project in the question has not been launched, I was talking about the other one. – x-yuri Sep 12 '18 at 17:09