1

i need to serialize a RDD read from HBASE into alluxio memory file system as way to cache and update it periodically to be used in incremental SPARK computation.

Codes are like this, but run into titled exception

val inputTableNameEvent = HBaseTables.TABLE_XXX.tableName
val namedeRDDName = "EventAllCached2Update"
val alluxioPath = "alluxio://hadoop1:19998/"
val fileURI = alluxioPath + namedeRDDName
val path:AlluxioURI = new AlluxioURI("/"+namedeRDDName)

val fs:FileSystem = FileSystem.Factory.get()

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, inputTableNameEvent)

val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
                classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
                classOf[org.apache.hadoop.hbase.client.Result])
numbers = rdd.count()
println("rdd count: " + numbers)
if( fs.exists(path))
       fs.delete(path)
rdd.saveAsObjectFile(fileURI)

Can anyone help tell how to map ImmutableBytesWritable to another type to bypass this problem? Also the map need to be revertable, as later i need to use objectFile to read this saved object back and turn it into a [(ImmutableBytesWritable, Result)] RDD to be used for update and computation later.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
bronzels
  • 1,283
  • 2
  • 10
  • 16

1 Answers1

0

You need to convert the rdd into row object. Something like below and then save that to hdfs. The parsed RDD is like any other rdd now with data

val parsedRDD = yourRDD.map(tuple => tuple._2).map(result => (
      Row((result.getRow.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column1".getBytes()).get(0).getValue.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column2".getBytes()).get(0).getValue.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column3".getBytes()).get(0).getValue.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column4".getBytes()).get(0).getValue.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column5".getBytes()).get(0).getValue.map(_.toChar).mkString),
      (result.getColumn("CF".getBytes(),"column5".getBytes()).get(0).getValue.map(_.toChar).mkString)
      )))
Ramzy
  • 6,948
  • 6
  • 18
  • 30