0

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)

I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.

Greg
  • 133
  • 7
  • 2
    One idea is to remember a hash of the file and file length. – e0k Nov 29 '16 at 00:04
  • 3
    `You can read the file but you can't change it or copy it` <--- If I can read it, I can copy it. – wim Nov 29 '16 at 00:05
  • And you can change it while copying it. – Remy Lebeau Nov 29 '16 at 00:06
  • I've added some context now but copying the file would be impractical, because then I would just have two copies of the file. – Greg Nov 29 '16 at 00:10
  • 3
    On an NTFS or ReFS filesystem, at least, as long as the file remains on the same volume, it has a unique ID assigned to it that is persisted even if the file is moved, renamed, modified, etc. That ID can be used with [`OpenFileById()`](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365432.aspx), for instance. When checking if two file handles refer to the same file, you can check if they have the same ID, rather than comparing their file paths. You can retrieve a file's unique ID using `GetFileInformationByHandle/Ex()`. – Remy Lebeau Nov 29 '16 at 00:11
  • 1
    Keep the MD5 Sum or SHA1 of the file in the database for that file. – Chimera Nov 29 '16 at 00:18

1 Answers1

1

I have done exactly this, it's not hard.

I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.

For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Loren Pechtel
  • 8,945
  • 3
  • 33
  • 45