0

some part of a software, which I try to develop, is a file tracker. It tracks movie or series files (500 MB - 50 GB). It have to track files even if they are

  • moved on the same disk
  • moved to another disks
  • moved on network shares
  • renamed
  • and so on

If one of that happens it have to scan all atached devices and reindex the "lost" file. But that seams to be much harder than I thought. I googeled so many times an tried so many things but nothing seams to be really good. I tried several things but they always lack atleast one criteria:

  • FileSystemID -> only works on a single disk
  • FileName -> Doesnt work on rename
  • FielSize -> Pretty unstable
  • Hashing -> EXTREMLY expensive. Doesnt work on low power machines
  • Windows-API-Code-Pack -> I never be able to save. Some time with exception sometimes it seams to work but it doesnt. Seams to be out of development
  • Watermark files -> It seams that it is possible to append a GUID on the end of a file, but it change the file and it seams to be slow with really large files.

If they get combined it seams that it can fix the problem but that is quite complex in code and computing time. The best experience I made with the Watermarking. Maybe there is a way to append and read the GUID with a better performance? But to me it seams that it is really slow on large files.

I really need a way to identify files quick and consistent. The identifier must not get lost and work on NTFS and ext#. I hope to get some nice tips to my complex questions. Thanks :)

Stefm
  • 31
  • 1
  • 3
  • Since it is quite common interview question you should be able to find plenty discussion on it already... Bing search if you are ok using it - https://www.bing.com/search?q=google+interview+question+file+duplicate+hashing – Alexei Levenkov Dec 04 '16 at 18:57
  • If the only thing you are interested is faster appending to file - than this is same as http://stackoverflow.com/questions/2398418/how-append-data-to-a-binary-file (note that appending to file *does not* solve problem you describe as you change file and can break file format) – Alexei Levenkov Dec 04 '16 at 18:59

1 Answers1

1

I am not sure what you mean by "unstable" when it comes to file size.

My suggestion would be to use the file size in bytes as a first means of indexing (as it is cached by the OS and allows for extremely quick duplicate check)

Afterwards you can use a FileStream to not read the full file, but the first 1 MB (or whatever you choose) of the file and Hash that. This should be fairly quick.

This should give you a pretty accurate tracking of the file, even though not perfect. But if you want perfect, then Hashing the complete file is a must.

In NTFS you could use "Alternate Streams" to append IDs to the file, but those can also be freely added/removed by the user and will get lost when leaving the NTFS space.

Martin
  • 1,028
  • 17
  • 23
  • I think thats unstable because ther could be another file with the same size. And Im not sure what happens to this value it it is moved to another disk, with maybe other block sizes. Full hashing of few TB of data is to expensive even on large servers because you have to re-hash the most files if one get lost. I never heard something about "alternatice Streams". That sounds interesting – Stefm Dec 04 '16 at 18:50
  • The file size is (at least in Windows) not affected by cluster size or compression unless you specifically ask for those values. So the file size should be the actual number of content bytes and thus be pretty accurate. If you add the partial Hashing, you should be fine IMHO – Martin Dec 05 '16 at 13:17