0

I am looking into a possible solution to quite basic question: is there a recommended practice to make sure a file being stored in GridFS is in fact not going to create a duplicate? We noticed that in very rare occasions it might happen that our store call, as simple as it can get (using Java driver) can in fact create a duplicate of the new file in case of parallel execution.

    GridFS gridfs = new GridFS(db);
    GridFSInputFile file = new GridFSInputFile(gridfs, fileContent, fileName, true);
    file.put("type", "email");
    file.setContentType(contentType);
    file.save();

We are using FSYNC_SAFE as write concern in this case, collection is sharded. Should we avoid completely the usage of the mongo driver and go for direct writes into gridfs files collection, to add extra logic, or is it easier just to, after save is done, to check and remove duplicate (which is of course not optimal).

Milan Aleksić
  • 1,415
  • 15
  • 33
  • I would save an md5 hash in a uniquely indexed field in the `files` collection to prevent duplicate insertions. I believe some drivers actually do this already since they verify the correctness of the file upload using the [filemd5](http://docs.mongodb.org/manual/reference/command/filemd5/) command, which computes the md5 hash on the server-side. – wdberkeley Dec 31 '14 at 16:53
  • There is already a md5 calculated by java driver. But, if I want to make sure that only one file ends one existing, like an atomic operation let's say, how would this md5 help? Don't get me wrong, I still think that findOrModify on that collection is probably solution of the problem, but that is not directly allowed by the driver (on gridfs), so it might take some time to be implemented – Milan Aleksić Dec 31 '14 at 18:16
  • The md5 is (in all probability) a unique value for each file, regarded as a sequence of bytes. If you uniquely index the md5 and insert it in files as a field on the file document, an index collision means you're inserting the same file twice. – wdberkeley Jan 02 '15 at 02:14
  • yes, of course, but it at some point doesn't scale, like in our case... too many millions of documents to put an extra index unfortunately – Milan Aleksić Jan 02 '15 at 09:08
  • What were the scaling problems when you tried it? What limited the scaling? Size of the index? How large was it for your millions of documents? – wdberkeley Jan 02 '15 at 14:20
  • The size of the index was the bottleneck, but don't ask me for exact number, we don't really have it - but most probably measured in GBs. Nevertheless, that is probably the only way to go, so do write an answer to just use an index since there is no other way to make it happen, I will accept the answer. If some other idea pops up in the future... voting will show in case this question gets some traction. – Milan Aleksić Jan 02 '15 at 14:35

0 Answers0