0

We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records) bigtable.

SparkConf sparkConf = new SparkConf().setAppName("mc-bigtable-sample-scan")
JavaSparkContext jsc = new JavaSparkContext(sparkConf);

Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.set(TableInputFormat.INPUT_TABLE, "listings");
Scan scan = new Scan();
scan.addColumn(COLUMN_FAMILY_BASE, COLUMN_COL1);
hbaseConf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,
            ImmutableBytesWritable.class, Result.class);
System.out.println("source count " + source.count());

The count show properly for smaller table. But it shows zero for larger table.

Tried many different configuration options like increasing driver memory, number of executors, number of workers but nothing works.

Could someone help please?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • 1
    Please include a [minimal, verifyable and complete example](https://stackoverflow.com/help/mcve) to make it easier to help you. Nobody knows how your table looks or what data it contains, and therefore it's difficult to find an appropriate answer. – Thomas Flinkow Mar 02 '18 at 09:37

1 Answers1

1

My bad. Found the issue in my code. The column COLUMN_COL1 which I was trying to scan was not available in bigger bigtable and hence my count was appearing 0.