Suggestions for improving Bigtable/HBase/Spark workflow (economically)

Question

I have a rather small dataset (5~gb, 1.5 million rows), currently stored in Bigtable, and accessed through the HBase API (Scala), for the purpose of doing data analytics with Spark (dataproc).

However, I'm also on a tight budget, and the Bigtable cost is rather high (2~ USD/hour), so what I ended up doing is deleting and recreating the Bigtable cluster whenever I need it.

The obvious drawback is that it takes quite a while to populate a fresh cluster, due to the nature of my data. It's all stored in a single big textfile as JSON, and it takes 40~ minutes to populate the cluster.

So what I'm asking is whether there are better way to execute this, like implementing some kind of backup/snapshot routine? Or simply not using Bigtable at all. I couldn't find any other HDFS alternatives in the Google Cloud platform.

It should be noted that I'm rather fresh to the world of Hadoop and big data, so excuse my ignorance if I'm missing the obvious.

score 1 · Answer 1 · answered Nov 10 '15 at 18:28

First off, if you haven't seen it, we show how to use Cloud Bigtable with Dataproc. It should be easy to spin up a job to populate your Bigtable quickly if that's what you wish.

Bigtable is really designed for 1T or larger databases. At the 5GB size, you might wish to consider Memcache or Redis. With Redis, you would only need to load your data once, then you could save the disk when you spin down your instance / cluster.

score 1 · Answer 2 · edited May 16 '16 at 00:07

1

Additionally (if it fits your use case and you don't need the database aspects of Bigtable), you can run Hadoop or Spark jobs (using Google Cloud Dataproc if you'd like) directly over files in Google Cloud Storage, which will be substantially cheaper than storing the data in Bigtable.

See Google Cloud Storage connector for more info.

edited May 16 '16 at 00:07

Misha Brukman

12,938
4
61
78

answered Nov 10 '15 at 20:27

Max

1,528
1
11
17

score 1 · Answer 3 · answered Nov 12 '15 at 20:22

Consider importing the json once, and then exporting the data to sequence files via hadoop, as described here: https://cloud.google.com/bigtable/docs/exporting-importing. The sequence file format that hadoop uses is likely to be more efficient than json.

Suggestions for improving Bigtable/HBase/Spark workflow (economically)

3 Answers3