How to process a range of hbase rows using spark?

Question

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could find a way to use all rows by creating an rdd http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase But how do we create a RDD for a range scan ?

All suggestions are welcome.

zsxwing · Accepted Answer · 2014-08-11T08:32:51.733

9

Here is an example of using Scan in Spark:

import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64

def convertScanToString(scan: Scan): String = {
  val out: ByteArrayOutputStream = new ByteArrayOutputStream
  val dos: DataOutputStream = new DataOutputStream(out)
  scan.write(dos)
  Base64.encodeBytes(out.toByteArray)
}

val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count

You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.

edited Aug 11 '14 at 08:32

answered Aug 11 '14 at 08:17

zsxwing

20,270
4
37
59

1

What version of HBase are you using for this? I can't find a `Scan.write` method in any of the versions I've scanned. – Ken Williams Mar 16 '15 at 19:07
Hbase uses Protobuf now. You can find how to implement `convertScanToString` here: https://github.com/apache/hbase/blob/f7df0990c2d321cffd7ea2e20cb7b280d8cc9db6/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java#L510 – zsxwing Mar 17 '15 at 06:32
Will this setCaching(500) creat rdd with 500 rows in HBase? I tried, it still get all data from Hbase. – Laodao Dec 23 '15 at 21:38
No. The client will request 500 rows every time but still fetch all data. – zsxwing Dec 23 '15 at 22:11
In order to get the imports to work, I had to use `org.apache.hbase:hbase-client:1.1.2 org.apache.hbase:hbase-common:1.1.2 org.apache.hbase:hbase-server:1.1.2 ` – codeaperature Jan 10 '16 at 06:05
@zsxwing - Is there a possibility you can detail the convertScanToString part in Scala? – codeaperature Jan 10 '16 at 07:52
@StephanWarren this works: `def convertScanToString(scan: Scan): String = { val proto = ProtobufUtil.toScan(scan) Base64.encodeBytes(proto.toByteArray()) }` of course you need to `import org.apache.hadoop.hbase.protobuf.ProtobufUtil` first. – Alfredo Gimenez Mar 23 '16 at 00:30
@spiffman - Thx - I figured this out with a bit of help. I found the following site useful - http://javatoscala.com/ - I should have updated this question / answer earlier. However, the next person with the same question surely can benefit. from this knowledge. – codeaperature Mar 23 '16 at 06:49

score 8 · Answer 2 · edited Jan 08 '16 at 00:29

8

You can set below conf

 val conf = HBaseConfiguration.create()//need to set all param for habse
 conf.set(TableInputFormat.SCAN_ROW_START, "row2");
 conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");

this will load rdd only for those reocrds

edited Jan 08 '16 at 00:29

ρяσѕρєя K

132,198
53
198
213

answered Jan 08 '16 at 00:10

Narendra Parmar

1,329
12
17

Roman Kondakov · Answer 3 · 2017-05-31T10:49:38.983

Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.io.IOException;

public class HbaseScan {

    public static void main(String ... args) throws IOException, InterruptedException {

        // Spark conf
        SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);

        // Hbase conf
        Configuration conf = HBaseConfiguration.create();
        conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");

        // Create scan
        Scan scan = new Scan();
        scan.setCaching(500);
        scan.setCacheBlocks(false);
        scan.setStartRow(Bytes.toBytes("a"));
        scan.setStopRow(Bytes.toBytes("d"));


        // Submit scan into hbase conf
        conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));

        // Get RDD
        JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
                .newAPIHadoopRDD(conf, TableInputFormat.class,
                        ImmutableBytesWritable.class, Result.class);

        // Process RDD
        System.out.println(source.count());
    }
}

How to process a range of hbase rows using spark?

3 Answers3

Linked