Multiple applications using copies of a directory on a SAN

Question

I have an application (Endeca) that is a file-based search engine. A customer has Linux 100 servers, all attached to the same SAN (very fast, fiber-channel). Currently, each of those 100 servers uses the same set of files. Currently, each server has their own copy of the index (approx 4 gigs, thus 400 gigs in total).

What I would like to do is to have one directory, and 100 virtual copies of that directory. If the application needs to make changes to any of the files in that directory, only then would is start creating a distinct copy of the original folder.

So my idea is this: All 100 start using the same directory (but they each think they have their own copy, and don't know any better). As changes come in, Linux/SAN would then potentially have up to 100 copies (now slightly different) of that original.

Is something like this possible?

The reason I'm investigating this approach would be to reduce file transfer times and disk space. We would only have to copy the 4 gig index files once to the SAN and create virtual copies. If no changes came in, we'd only use 4 gigs instead of 400.

Thanks in advance!

score 0 · Answer 1 · answered Feb 08 '13 at 01:50

The best solution here is to utilise the "de-dupe" functionality at the SAN level. Different vendors may call it differently, but this is what I am talking about:

https://communities.netapp.com/community/netapp-blogs/drdedupe/blog/2010/04/07/how-netapp-deduplication-works--a-primer

All 100 "virtual" copies will utilise the same physical disk blocks on the SAN. SAN will only need to allocate new blocks if there are changes made to a specific copy of a file. Then a new block will be allocated for this copy but the remaining 99 copies will keep using the old block - thus dramatically reducing the disk space requirements.

score 0 · Answer 2 · answered Feb 21 '13 at 19:34

What version of Endeca are you using? MDEX7 engine has the clustering ability where the leader and follower nodes are all reading from the same set of files, so as long as the files are shared (say over NAS) then you can have multiple engines running on different machines backed by the same set of index files. Only the leader node will have ability to change the files keeping the changes consistent, the follower nodes will then be notified by the cluster coordinator when the changes are ready to be "picked up".

In MDEX 6 series you could probably achieve something similiar provided that the index files are read-only. The indexing in V6 would usually happen on another machine and the destination set of index files would usually be replaced once the new index is ready. This though won't help you if you need to have partial updates.

Netapp deduplication sounds interesting, Endeca has never tested the functionality, so I am not sure what kinds of problems you will run into.

Multiple applications using copies of a directory on a SAN

2 Answers2