2

I run a file sharing site, which is quickly growing in popularity.

Right now my web app is on AWS elastic beanstalk so of course scales up beautifully, however my files are currently all served from a single dedicated box. The box is starting to max out its 1gbps connection, so i'm trying to research how to scale the file storage up too.

NB: I also have all the file synced to S3, but its far too costly to serve them from there due to S3 bandwidth charges. My dedicated box is unmetered.

So far I've seen talk of DRBD and Lsyncd, but neither feel like what I'm looking for.

Any advice on the best setup for running multiple file storage linux boxes in real time sync behind a load balancer would be GREATLY appreciated.

P.S - worth noting my ideal scenario is they are all in sync at all times, so if a file is added to one box, it is synced across all boxes. Same for when a file is deleted.

Ryan
  • 21
  • 2

1 Answers1

2

GlusterFS is a great for this, as is Ceph. GlusterFS is easier to manage, and does not use node-to-node replication as a primary method of data replication or distribution. It can perform 2n or 3n brick mirroring, where a brick is merely a filesystem on a node. A complete array of bricks is referred to a volume, and a volume is mounted like an NFS share - with the exception that this is writing and reading to multiple nodes rather than only one.

Gluster scales up and out beautifully, and has no concept of a master node. All nodes equally participate in volumes they are a member of. It's the clients that connect to GlusterFS that are responsible for fanning out data to all nodes, rather than each node responsible for replicating data. This way, you don't have to have huge, badly scaling backhaul links.

The following is a good step-by step guide on how to set it up: https://www.digitalocean.com/community/tutorials/how-to-create-a-redundant-storage-pool-using-glusterfs-on-ubuntu-servers

The gluster documentation is also worth reading: https://gluster.readthedocs.io/en/latest/

Spooler
  • 7,046
  • 18
  • 29
  • Overall a good idea, though it can get ... a little hairy if you want to replicate over long distances. – Michael Hampton Oct 05 '16 at 04:10
  • YES, as all you're using for geo-replication in Gluster currently is some creative rsync. Not exactly the most timely and efficient method. – Spooler Oct 05 '16 at 04:20
  • @MichaelHampton Be aware that GlusterFS is **not** meant to replicate in a read/write fashion over WAN links (ie: high latency and low reliability). To replicate over long distances you had to use georeplication which is a read-only replication done via rsync. – shodanshok Oct 05 '16 at 06:12
  • You definitely want to cover this, since someone asking this question _will_ want georeplication, either now, or in the near future. – Michael Hampton Oct 05 '16 at 08:24