3

I'm trying to use an instance of a Dataproc cluster to import large CSV files to HDFS, then export them to SequenceFile format, then finally to import the latest to Bigtable as described here: https://cloud.google.com/bigtable/docs/exporting-importing

I initially imported the CSV files as an external table in Hive, then exported them by inserting them in a SequenceFile backed table.

However (probably since it seems dataproc ships with Hive 1.0?), I faced the cast exception error mentioned here: Bigtable import error

I can't seem to get HBase shell or ZooKeeper up and running on the dataproc master VM, so I can't run a simple export job from CLI.

  1. Is there an alternative way I could export bigtable-compatible sequence files from dataproc ?

  2. What's the proper configuration to setup to get HBase and ZooKeeper running from Dataproc VM master node ?

Community
  • 1
  • 1
mssch
  • 154
  • 7

1 Answers1

2

The import instructions you linked to are instructions for importing data from an existing HBase deployment.

If the input format you're working with is CSV, creating SequenceFiles is probably an unnecessary step. How about writing a Hadoop MapReduce to process the CSV files and write directly to Cloud Bigtable? A Dataflow would also be a good fit here.

Take a look at samples here: https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/java

Max
  • 1,528
  • 1
  • 11
  • 17
  • thx. I ended up figuring this out and started working on a MR job as mentioned. It does puzzle me though that Dataproc does not ship with Bigtable support built in (i had to install the libs and setup HBase myself). Plus I have ran into several zookeeper related issues while trying to submit hadoop jobs locally... Any plan to merge dataproc with bdutil soon ? shall i use the latest only for the time being ? – mssch Oct 02 '15 at 10:03
  • I can't speak to specific timelines right now, but it's definitely a goal to get all of the pieces of our Big Data ecosystem integrated, and doing so with minimal developer friction. Stay tuned! – Max Oct 02 '15 at 17:41
  • An update to Hive for Dataproc is in the works - it should ship with Dataproc in the next few weeks. – James Oct 13 '15 at 20:03
  • @ Max the code here writes the csv data having multiple columns into a single BigTable column.Can you provide a way to resolve this?? – Aman Vaishya Jul 05 '17 at 08:07