2

I am using Spark to load Hbase data into a JavaPairRDD<>. Right now, I am wondering if I can just load the latest 100 rows into Spark instead of all rows from Hbase. 1) I tried scan.setCaching(100), but it still returned all rows. Is it for the purpose to limit the rows I load from Hbase? 2) How can I make sure it is the latest 100 rows

Any ideas? Thanks a lot.

    Scan scan = new Scan();
    scan.setFilter(A list of filters);
    scan.setCaching(100);

    ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
    String scanStr = Base64.encodeBytes(proto.toByteArray());

    hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
    hbaseConfig.set(TableInputFormat.SCAN,scanStr);

    JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = 
            javaSparkContext.newAPIHadoopRDD(hbaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class).cache();
Laodao
  • 1,547
  • 3
  • 17
  • 39

1 Answers1

1

Scan.setCaching is used for specify result count in one RPC call, when you set it to 100, your client will get result in 100 by 100, if there is 100 result, if not, you will get exact result count in one rpc call. SetCaching is network performance related optimization, and doesn't change result count from db you get.

For ordered result, eg, last 100, you need to dfine what is LAST ? A user's last 100 activity, or last 100 inserted row in all table ? If you mean table, hbase will not return your data in what order you write, it will return as ordered by row key byte value, so you should make your rowkey timebased to get ordered result. But time in first part of rowkey will make hotspot regions, so you should not do that :)

halil
  • 1,789
  • 15
  • 18