2

If I read some data from an HBase (or MapR-DB) table with

JavaPairRDD<ImmutableBytesWritable, Result> usersRDD = sc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

the resulting RDD has 1 partition, as I can see calling usersRDD.partitions().size(). Using something like usersRDD.repartition(10) is not viable, as Spark complains because ImmutableBytesWritable is not serializable.

Is there a way to make Spark create a partitioned RDD from HBase data?

Skice
  • 461
  • 5
  • 18

1 Answers1

1

Number of Spark partitions when using org.apache.hadoop.hbase.mapreduce.TableInputFormat depends on the number of regions of HBase table - in your case it's 1 (the default). Have a look at my answer to a similar question for more details.

Community
  • 1
  • 1
botchniaque
  • 4,698
  • 3
  • 35
  • 63