3

I have a large csv dataset (>5TB) in multiple files (stored in a storage bucket) that I need to import into Google Bigtable. The files are in the format:

rowkey,s1,s2,s3,s4
text,int,int,int,int
...

There is an importtsv function with hbase that would be perfect but this does not seem to be available when using Google hbase shell in windows. Is it possible to use this tool? If not, what is the fastest way of achieving this? I have little experience with hbase and Google Cloud so a simple example would be great. I have seen some similar examples using DataFlow but would prefer not to learn how to do this unless necessary.

Thanks

mattrix
  • 55
  • 1
  • 6

2 Answers2

7

The ideal way to import something this large into Cloud Bigtable is to put your TSV on Google Cloud Storage.

  • gsutil mb <your-bucket-name>
  • gsutil -m cp -r <source dir> gs://<your-bucket-name>/

Then use Cloud Dataflow.

  1. Use the HBase shell to create the table, Column Family, and the output columns.

  2. Write a small Dataflow job to read all the files, then create a key, followed by writing the table. (See this example to get started.)

A bit easier way would be to: (Note- untested)

  • Copy your files to Google Cloud Storage
  • Use Google Cloud Dataproc the example shows how to create a cluster and hookup Cloud Bigtable.
  • ssh to your cluster master - the script in the wordcount-mapreduce example will accept ./cluster ssh
  • Use the HBase TSV importer to start a Map Reduce job.

    hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> gs://<your-bucket-name>/<dir>/**

  • Thank you Les. I was able to get the second option to work. May I ask when the hdd option for Bigtable will be available? – mattrix Dec 14 '15 at 22:45
  • It's in progress -- should be mid to late Q1. – Les Vogel - Google DevRel Dec 14 '15 at 23:57
  • BIG warning for people using this technique. After wasting LOTS of money (Google - want to give me a refund given no errors were shown? :)), note that if you use importtsv with text files from Google storage that are compressed (i.e. uploaded using gsutil -z) then the importtsv says import is successful but actually fails by not importing after a certain amount of data in each file. I found this out after importing over 4 TB of data and have to restart because I only tested with smaller files :( Using uncompressed text data seems (so far) to work fine. – mattrix May 25 '16 at 09:46
  • @mattrix – sorry about the inconvenience, and thanks for the details. Please get in touch with me, I'm the PM for Cloud Bigtable. You can find me on Twitter, LinkedIn, GitHub, Slack, etc. – Misha Brukman Feb 14 '17 at 05:06
  • 1
    I followed the steps for using Dataflow and was able to get this to work. I wrote up a more detailed explanation and have a Dataflow job on Github you can use to make this easier. Check it out here: https://cloud.google.com/community/tutorials/cbt-import-csv – Billy Jacobson Jul 27 '18 at 18:27
0

I created a bug on the Cloud Bigtable Client project to implement a method of doing importtsv.

Even if we can get importtsv to work, setting up Bigtable on your own machine may take some doing. Importing a file this big is a bit involved for a single machine, so usually a distributed job (Hadoop or Dataflow) is needed, so I'm not sure how well running the job from your machine is going to work.

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
Solomon Duskis
  • 2,691
  • 16
  • 12
  • Thanks for your response. I think you're clearing up some of my confusion - I'm not used to this environment at all. I was hoping that you could run importtsv on Google Cloud, not a local machine, using the cluster (3 nodes). Does this require a distributed job? Ideally it would be using the Google Cloud Shell from the developers console but, if not, using the Google hbase shell. – mattrix Dec 06 '15 at 00:54