Spark: Partitioning an RDD created from HBase data

Question

If I read some data from an HBase (or MapR-DB) table with

JavaPairRDD<ImmutableBytesWritable, Result> usersRDD = sc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

the resulting RDD has 1 partition, as I can see calling usersRDD.partitions().size(). Using something like usersRDD.repartition(10) is not viable, as Spark complains because ImmutableBytesWritable is not serializable.

Is there a way to make Spark create a partitioned RDD from HBase data?

score 1 · Answer 1 · edited May 23 '17 at 10:32

1

Number of Spark partitions when using org.apache.hadoop.hbase.mapreduce.TableInputFormat depends on the number of regions of HBase table - in your case it's 1 (the default). Have a look at my answer to a similar question for more details.

edited May 23 '17 at 10:32

Community

1
1

answered Oct 20 '16 at 22:04

botchniaque

4,698
3
35
63

Spark: Partitioning an RDD created from HBase data

1 Answers1