2

I have a use-case to read from HBase inside a pyspark job and is currently doing a scan on the HBase table like this,

conf = {"hbase.zookeeper.quorum": host, "hbase.cluster.distributed": "true", "hbase.mapreduce.inputtable": "table_name", "hbase.mapreduce.scan.row.start": start, "hbase.mapreduce.scan.row.stop": stop}

rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,conf=cmdata_conf)

I am unable to find the conf to do a GET on the HBase table. Can someone help me? I could find that filters are not supported with pyspark. But is it not possible to do a simple GET?

Thanks!

void
  • 2,403
  • 6
  • 28
  • 53
  • if it is in between job you cant directly execute get with rowkey using python api may be happybase ? so that no need to call separate job for that. isnt it possible ? – Ram Ghadiyaram Dec 22 '16 at 15:06
  • happybase doesn't support the latest versions of hbase .. – void Dec 23 '16 at 06:50
  • its an example, I gave (any other means if not happybase you can try) I am not good in python. simple example i will tell you from mapreduce + hbase driver we used scan. inside map method of mapreduce we have to lookup certain rows by using `get` rowkey. we are able to achieve this by creating seperate connection on that particular table. and ensured that it is closed once work is done. in that way you can also do that. Please free to ask any queries. What I say out of my experience that its possible. – Ram Ghadiyaram Dec 23 '16 at 11:05

0 Answers0