I am reading data from HBase through spark. The code runs fine when reading the data using a prefix filter with a complete rowkey
or using GET
, but it freezes if I use a partial prefixed rowkey
. The rowkey
structure is md5OfAkey_Akey_txDate_someKey
. I want to read all data matching “Akeys
” from a data frame. The table has a single column family , 50 column qualifiers and has around 200 million records. So when I read using md5OfAkey_Akey_txDate
the code gets stuck while if I construct the whole key it runs fine. But I do not want to pass the whole rowkey
as I want to read all data for a particular account(Akey
) and transaction date (txDate
). Any help would be appreciated.
Asked
Active
Viewed 53 times
0

Nikhil Suthar
- 2,289
- 1
- 6
- 24

Shaggy1755
- 3
- 5
-
Performing a scan by partial rowkey (i.e. using PrefixFilter) is expected to be slower than direct `get`. Can you quantify "stuck" or does it never return? – mazaneicha Jan 03 '20 at 19:49
-
sorry for the late reply. I went ahead with the multirowrange filter in hbase and the code runs much faster than the prefix or the fuzzy filter. I am still not sure why the prefix filter was taking more than 10 minutes to get data for a single partial rowkey whereas the multirowrange filter brings the same data in seconds. – Shaggy1755 May 08 '20 at 16:40