0

I am writing an abstraction layer that will abstract a back end implementation of (yet to be decided) distributed file system.

Possible choices for file systems to be used are HDFS, GlusterFS, CEPH ... .

Front end will be SOAP/ REST services.

The abstraction layer to be implemented will receive a stream of data from web-services and send it to back-end distribute file system.

The file sizes will be Multiple GBytes.

My question

What is the best approach to push data into distributed file system - if we need max through-put, no loss of data, and leveraging the distributed nature of back-end file system

Yogesh Devi
  • 617
  • 11
  • 30
  • 1
    Generally speaking each of these distributed file systems could be used as a high-throughput data sink, but to really answer your question correctly you need to provide information on what guarnatees you need the system to make. For instance, is strict ordering of inserts required? There are already systems like Kafka that create a log abstraction for ingesting streams, but doesn't provide strict global ordering. – Noah Watkins Feb 03 '14 at 16:28
  • Noah, I edited the description to speak more about the to talk about the needed characteristics of the abstraction layer ..., please check that out - and BTW a big Thanks for pointing me to Kafka, that is interesting, however I wonder how will it work for gigabyte size file sizes ... – Yogesh Devi Feb 05 '14 at 10:17
  • Kafka message.max.bytes is 1000000 by default. Kafka consumer does not have support for streaming a message and has to allocate memory to be able to read the largest message. SO Kafka is not an option !! – Yogesh Devi Feb 05 '14 at 10:58
  • Do you need POSIX semantics for the files, or does the data look more like immutable key/value pairs? Using something like Ceph's RADOSGW provides an Amazon S3 compatible interface. – Noah Watkins Feb 05 '14 at 15:43

0 Answers0