2

Does anybody have experience or succeeded on loading data from Bigtable via Pig on Dataproc using HBaseStorage?

Here's a very simple Pig script I'm trying to run. It fails with an error indicating it can't find the BigtableConnection class and I'm wondering what setup I may be missing to successfully load data from Bigtable.

raw = LOAD 'hbase://my_hbase_table'
       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       'cf:*', '-minTimestamp 1490104800000 -maxTimestamp 1490105100000 -loadKey true -limit 5')
       AS (key:chararray, data);

DUMP raw;

Steps I followed to setup my cluster:

  1. Launched Bigtable cluster (my_bt); created and populated my_hbase_table
  2. Launched Dataproc cluster (my_dp) via cloud.google.com Cloud Dataproc Console
  3. Installed HBase shell on Dataproc master (/opt/hbase-1.2.1) following instructions on https://cloud.google.com/bigtable/docs/installing-hbase-shell
  4. Added properties to hbase-site.xml for my_bt and BigtableConnection class
  5. Created file t.pig with contents listed above
  6. Invoked Pig via command: gcloud beta dataproc jobs submit pig --cluster my_dp --file t.pig --jars /opt/hbase-1.2.1/lib/bigtable/bigtable-hbase-1.2-0.9.5.1.jar
  7. Got the following error indicating BigtableConnection class not found:

2017-03-21 15:30:48,029 [JobControl] ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat - java.io.IOException: java.lang.ClassNotFoundException: com.google.cloud.bigtable.hbase1_2.BigtableConnection

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
EduBoom
  • 147
  • 1
  • 8
  • I would suggest using the shaded bigtable mapreduce jar, which has all of the dependencies you'll need. Go to http://search.maven.org/#search%7Cga%7C1%7Cbigtable%20mapreduce,and download "shaded.jar" . – Solomon Duskis Mar 22 '17 at 13:12
  • Looks like `,and` was auto-appended to @SolomonDuskis' URL due to lack of space to separate them; you want to visit http://search.maven.org/#search%7Cga%7C1%7Cbigtable%20mapreduce to download the artifact. – Misha Brukman Mar 22 '17 at 13:23
  • I downloaded the shaded.jar and get the same error when submitting the pig job. I can upload the output I get when running the test if that helps. – EduBoom Mar 22 '17 at 14:58
  • Can you try to add netty-tcnative-boringssl-static? See ttp://search.maven.org/#search%7Cga%7C1%7Cg%3A%22io.netty%22%20AND%20a%3A%22netty-tcnative-boringssl-static%22%20AND%20v%3A%221.1.33.Fork26%22 and download "jar" – Solomon Duskis Mar 22 '17 at 22:41

1 Answers1

3

The trick is getting all dependencies on pig's classpath. Using the jars pointed to by Solomon, I've created the following initialization action that downloads two jars, the bigtable mapreduce jar and netty-tcnative-boringssl, and sets up the pig classpath.

#!/bin/bash
# Initialization action to set up pig for use with cloud bigtable
mkdir -p /opt/pig/lib/

curl http://repo1.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/1.1.33.Fork19/netty-tcnative-boringssl-static-1.1.33.Fork19.jar \
    -f -o /opt/pig/lib/netty-tcnative-boringssl-static-1.1.33.Fork19.jar

curl http://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-mapreduce/0.9.5.1/bigtable-hbase-mapreduce-0.9.5.1-shaded.jar \
    -f -o /opt/pig/lib/bigtable-hbase-mapreduce-0.9.5.1-shaded.jar

cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash

for f in /opt/pig/lib/*.jar; do
  if [ -z "\${PIG_CLASSPATH}" ]; then
    export PIG_CLASSPATH="\${f}"
  else
    export PIG_CLASSPATH="\${PIG_CLASSPATH}:\${f}"
  fi  
done
EOF

You can then pass in bigtable configuration in the usual ways:

  • Via hbase-site.xml
  • Specifying properties when submitting a job:

    PROPERTIES='hbase.client.connection.impl='
    PROPERTIES+='com.google.cloud.bigtable.hbase1_2.BigtableConnection'
    PROPERTIES+=',google.bigtable.instance.id=MY_INSTANCE'
    PROPERTIES+=',google.bigtable.project.id=MY_PROJECT'
    
    gcloud dataproc jobs submit pig --cluster MY_DATAPROC_CLUSTER \
        --properties="${PROPERTIES}"  \
        -e "f =  LOAD 'hbase://MY_TABLE' 
             USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:*','-loadKey true') 
             AS (key:chararray, data); 
        DUMP f;"
    
Angus Davis
  • 2,673
  • 13
  • 20
  • Thanks. I'll give it a try :) – EduBoom Mar 23 '17 at 14:21
  • The addition of pig-env.sh did the trick. But HBaseStorage, has options not supported by the BigTable client API. I got no results with the min/max timestamp options, but got results with -gte. It seams -lt is not supported. HBaseStorage uses RowFilter to implement -gte and -lt, but BigTable's implementation of RowFilter does not support that. What we actually use in our Pig jobs is a custom loader that creates Scan objects and do setStartRow() and setStopRow(). I don't know if those are supported by BigTable. I'll have to experiment. Thank you for your help. Eduardo. – EduBoom Mar 23 '17 at 18:46
  • setStartRow() and setStopRow() are indeed supported. Feel free to raise a github issue in the Cloud Bigtable client library about the RowFilter issue at https://github.com/GoogleCloudPlatform/cloud-bigtable-client – Solomon Duskis Mar 28 '17 at 01:12