How to decrease full table scan impact on Hbase cluster?

Question

Is there any possibility to limit poor query's impact on Hbase cluster?

If yes, what needs to be achieved?

Do I need kerberos to identify users and limit their query's impact or to assign resources to them?

Poor query's from phoenix can kill the whole Hbase cluster, and this is something that I really want to change. I will be extremely grateful for any hint in this topic.

score 2 · Answer 1 · answered Sep 25 '16 at 20:00

2

We had a similar issue at Splice Machine when running OLAP queries in our pre-2.0 versions. In 2.0 we introduced a new execution engine implemented on Spark that uses hybrid-scanners that read data directly from HFiles and merge it with data coming from the HBase Memstore, allowing us to reduce the impact to region servers of such large scans to a minimum, since we only access HBase's in memory data.

You can check how we implemented it in our repository. The main classes would be the SplitRegionScanner and the MemstoreAwareObserver.

answered Sep 25 '16 at 20:00

Daniel Gómez Ferro

376
1
6

1

Out of curiosity: how do you make sure that the HFiles are consistent for the duration of the Spark query -- through HBase snapshots? – Samson Scharfrichter Sep 26 '16 at 10:52
1

@SamsonScharfrichter we use a coprocessor (the MemstoreAwareObserver I linked earlier) to make sure our scans are consistent. We delay our scan if a compaction or split is running (for some milliseconds) and block compactions/splits while a scan is running in that region. – Daniel Gómez Ferro Sep 26 '16 at 14:51

How to decrease full table scan impact on Hbase cluster?

1 Answers1