Putting Multiple column names from a HBase Table into one SparkRDD

Question

I have to put multiple column families from a table in HBase into one sparkRDD. I am attempting this using the following code: (question edited after first aanswer)

import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable    
import scala.collection.JavaConverters._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark._
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client._
object HBaseRead {
   def main(args: Array[String]) {
     val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
     val sc = new SparkContext(sparkConf)        
     val conf = HBaseConfiguration.create()  
     val tableName = "TableName"  

     ////setting up required stuff 
     System.setProperty("user.name", "hdfs")        
     System.setProperty("HADOOP_USER_NAME", "hdfs")
     conf.set("hbase.master", "localhost:60000")
     conf.setInt("timeout", 120000)
     conf.set("hbase.zookeeper.quorum", "localhost")
     conf.set("zookeeper.znode.parent", "/hbase-unsecure")
     conf.set(TableInputFormat.INPUT_TABLE, tableName)
     sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result])) 
     val admin = new HBaseAdmin(conf)
     if (!admin.isTableAvailable(tableName)) {
          val tableDesc = new HTableDescriptor(tableName)
          admin.createTable(tableDesc)
     }
     case class Model(Shoes: String,Clothes: String,T-shirts: String)
     var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
     val transformedRDD = hBaseRDD2.map(tuple => {
         val result = tuple._2
         Model(Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Shoes"))),
         Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Clothes"))),
         Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("T-shirts")))
         )
     })
     val totalcount = transformedRDD.count()
     println(totalcount)
   }
}

What I want to do is to make a single rdd wherein values of first row (and subsequent rows later on) from these column families would be combined in a single array in the rdd. Any help would be appreciated. Thanks

score 1 · Accepted Answer · answered Oct 25 '16 at 13:53

1

You can do it couple of ways, inside rdd map you can get all the columns from the parent rdd[hBaseRDD2] and transform it and return it as another single rdd.

or you can create a case class and map it to that columns.

For example:

case class Model(column1: String,
                      column1: String,
                      column1: String)

var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
    val result = tuple._2
    Model(Bytes.toString(result.getValue(Bytes.toBytes("cf1"),Bytes.toBytes("Columnname1"))),
    Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2"))),
    Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2")))
    )
})

answered Oct 25 '16 at 13:53

Shankar

8,529
26
90
159

Hey, Thanks! Could you tell me some online sources where to read about this. Also, where can I find other methods. This solution is throwing the following error: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1446) at org.apache.spark.rdd.RDD.map(RDD.scala:286) – Ravi Ranjan Oct 25 '16 at 14:40
@RaviRanjan : post the complete code what you are trying, the above code should work without any issue. You can read about Spark transformations and actions on http://spark.apache.org/docs/latest/programming-guide.html#transformations – Shankar Oct 25 '16 at 14:47
kindly see. I have made the edits in the question itself. – Ravi Ranjan Oct 25 '16 at 15:23
@RaviRanjan: i already gave answer for the same kind of task serializable error issue http://stackoverflow.com/questions/40187793/writing-sparkrdd-to-a-hbase-table-using-scala – Shankar Oct 25 '16 at 15:32
@RaviRanjan: try doing the map operation as separate method not inside main or you can create Scala App instead of Scala Object and place the same content what you gave, it should work. – Shankar Oct 25 '16 at 15:33
@RaviRanjan: try implementing serializale to your scala object `HbaseRead` – Shankar Oct 25 '16 at 15:35
RaviRanjan : Task not serializable is famous error what @Shankar means is Object/Class xyz extends(not implements) Serializable – Ram Ghadiyaram Oct 25 '16 at 19:19
object hbaseread extends Serializable { doing this worked for me. Thanks a lot. – Ravi Ranjan Oct 26 '16 at 11:41

Putting Multiple column names from a HBase Table into one SparkRDD

1 Answers1