2

I have an easy task: I want to read HBase data in a Kerberos secured cluster. So far I tried 2 approaches:

  • sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
  • create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs

Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).

Do you have any example or tutorials or ....? I appreciate any help and hints.

Thanks in advance!

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Daniel
  • 2,409
  • 2
  • 26
  • 42

1 Answers1

1

I assume that you are using spark + scala +Hbase

import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;


object SparkWithMyTable {
  def main(args: Array[String]) {
    //Initiate spark context with spark master URL. You can modify the URL per your environment. 
    val sc = new SparkContext("spark://ip:port", "MyTableTest")

    val tableName = "myTable" 

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");



UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user@---", keyTabPath);

    // Add local HBase conf
   // conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
    conf.set(TableInputFormat.INPUT_TABLE, tableName)

    // create my table with column family
    val admin = new HBaseAdmin(conf)
    if(!admin.isTableAvailable(tableName)) {
      print("Creating MyTable")
      val tableDesc = new HTableDescriptor(tableName)
      tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
      admin.createTable(tableDesc)
    }else{
      print("Table already exists!!")
      val columnDesc = new HColumnDescriptor("cf1");
      admin.disableTable(Bytes.toBytes(tableName));
      admin.addColumn(tableName, columnDesc);
      admin.enableTable(Bytes.toBytes(tableName));
    }

    //first put data into table
    val myTable = new HTable(conf, tableName);
    for (i <- 0 to 5) {
      var p = new Put();
      p = new Put(new String("row" + i).getBytes());
      p.add("cf1".getBytes(), "column-1".getBytes(), new String(
                        "value " + i).getBytes());
      myTable.put(p);
    }
    myTable.flushCommits();
    
    //how to create  rdd
    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

    //get the row count
    val count = hBaseRDD.count()
    print("HBase RDD count:"+count)
    System.exit(0)
  }
}

Maven Artifact

<dependency>
  <groupId>it.nerdammer.bigdata</groupId>
  <artifactId>spark-hbase-connector_2.10</artifactId>
  <version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>

Can also have a look at

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • Hi, thanks for your hints. Creating an RDD a mentioned above does not work. It still has issues with the authentication - UserGroupInformation.loginUserFromKeytab("user@---", keyTabPath); only seems to be executed on the master but not on the worker nodes. – Daniel Oct 11 '16 at 07:23
  • Sorry - I accidentally hit "Add Comment" to early. My problem is still not solved - see my initial comment. – Daniel Oct 11 '16 at 07:28
  • hi are you using yarn cluster mode ? – Ram Ghadiyaram Oct 11 '16 at 07:30
  • then this could help you. https://community.hortonworks.com/questions/46500/spark-cant-connect-to-hbase-using-kerberos-in-clus.html – Ram Ghadiyaram Oct 11 '16 at 07:32
  • I tried both yarn-client and yarn-cluster. The link to the Hortonworks forum also does not really help since it is using the native HBase API instead of sc.newAPIHadoopRDD... – Daniel Oct 11 '16 at 07:38
  • have you seen this ? [SPARK-6918]|(https://issues.apache.org/jira/browse/SPARK-6918) – Ram Ghadiyaram Oct 11 '16 at 07:59
  • what ever the result may be keep posted. Intrigued to know how its working for you – Ram Ghadiyaram Oct 11 '16 at 08:49
  • No. But I used a (dirty) workaround and could make the data accessible through an external Hive table and process it from there. – Daniel Oct 25 '16 at 08:48