We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records) bigtable.
SparkConf sparkConf = new SparkConf().setAppName("mc-bigtable-sample-scan")
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.set(TableInputFormat.INPUT_TABLE, "listings");
Scan scan = new Scan();
scan.addColumn(COLUMN_FAMILY_BASE, COLUMN_COL1);
hbaseConf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
System.out.println("source count " + source.count());
The count show properly for smaller table. But it shows zero for larger table.
Tried many different configuration options like increasing driver memory, number of executors, number of workers but nothing works.
Could someone help please?