I am using Spark to load Hbase data into a JavaPairRDD<>. Right now, I am wondering if I can just load the latest 100 rows into Spark instead of all rows from Hbase. 1) I tried scan.setCaching(100), but it still returned all rows. Is it for the purpose to limit the rows I load from Hbase? 2) How can I make sure it is the latest 100 rows
Any ideas? Thanks a lot.
Scan scan = new Scan();
scan.setFilter(A list of filters);
scan.setCaching(100);
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String scanStr = Base64.encodeBytes(proto.toByteArray());
hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
hbaseConfig.set(TableInputFormat.SCAN,scanStr);
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD =
javaSparkContext.newAPIHadoopRDD(hbaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class).cache();