In Spark sc.newAPIHadoopRDD is reading 2.7 GB data the with 5 partitions

Question

I am using spark 1.4 and I am trying to read the data from Hbase by using sc.newAPIHadoopRDD to read 2.7 GB data but there are 5 task are created for this stage and taking 2 t0 3 minutes to process it. Can anyone let me know how to increase the more partitions to read the data fast ?

Number of tasks(or partitions) depends on the inputFormat used. So, it seems it is not possible to increase with standard input format. You may want to try newer spark-on-hbase or hbase connector packages — Ayan Guha, Sep 22 '16 at 02:24

score 1 · Answer 1 · answered Oct 20 '16 at 21:57

org.apache.hadoop.hbase.mapreduce.TableInputFormat creates a partition for each region. Your table seems to be split into 5 regions. Pre-splitting your table should increase the number of partitions (have a look here for more information on splitting).

In Spark sc.newAPIHadoopRDD is reading 2.7 GB data the with 5 partitions

1 Answers1

Linked