4

We are trying to move from the datacenter to the Google Compute Engine. While we understand how we can setup instances, and deploy the workload, we are not sure what is the best equivalent of storing data. We receive data once every day, and there are studies running all day on every server on the data received over the last 1-2 years. Any pointers ?

JJ Geewax
  • 10,342
  • 1
  • 37
  • 49
Humble Debugger
  • 4,439
  • 11
  • 39
  • 56
  • Could you clarify your question a bit more? What kinds of data, and how are you using it? You can run standard services you're familiar with like webservers, sftp, etc, or you can look at specific Google technologies to help augment your processing needs. – shollyman Jun 18 '13 at 02:10
  • We receive typically binary data, fixed structs. We are well versed with processing it. We have been operating on it under the traditional setting of a file server shared over a number compute servers in the data center. We are trying to move this process to the cloud. That part is new to us. – Humble Debugger Jun 19 '13 at 15:11

2 Answers2

3

It sounds like you're looking for a shared file server like NFS. You can run an NFS server on a single GCE instance to distribute the data to your various computation nodes. The Linux Documentation Project has a reasonable howto.

Another option is to use an object store like Google Cloud Storage, which allows you to store blobs of binary data under various names (a bit like a cloud filesystem). If your software needs to use standard filesystem commands to access the data, a FUSE filesystem like s3fuse can be used to export a Google Storage bucket as a set of files and directories on each machine.

How to choose between the two options:

  1. If you're already using NFS, you might be more comfortable continuing with the same configuration you have onsite. If not, I'd suggest giving s3fuse and GCS a try.
  2. If you run your own NFS server, you'll need to be responsible for any backups and so forth that you might need to do. Google Cloud Storage is replicated between multiple sites, so even if there's a maintenance in one site, you can still read and write your data.
  3. FUSE filesystems like s3fuse tend to support read and write operations, but may not support complex locking behavior or the like that NFS does.
  4. You may be charged for the number of reads and writes you do to data stored in GCS. (I don't recall; I think network traffic to/from GCS from GCE is free.) If you choose to run your own NFS server, you'll have to pay for the running instance and the persistent disk, as well as the read and write operations to the disk.

You might also be interested in this other Stack Overflow question, which covers some of the same ground: Storage options for diskless servers

Community
  • 1
  • 1
E. Anderson
  • 3,405
  • 1
  • 16
  • 19
1

Just to chime in and echo E. Anderson's answer, if you're already using NFS and thinking about using s3fuse, you might also want to take a look at gcsfuse which does a similar job but (I'm told) has better performance on GCS (see the gcsfuse-docs for all sorts of extra technical details).

JJ Geewax
  • 10,342
  • 1
  • 37
  • 49