6

I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)

Is there any recommendation on choosing startkey and endkey based on dataset?

My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.

Does this approach to find startkey and rowkey looks ok or is there better practice?

EDIT I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.

PS> I am displaying only some region info

1)

byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);

This gives regions with boundaries like:

{
    "startkey":"-INFINITY",
    "endkey":"11111111",
    "numberofrows":3628951,
},
{
    "startkey":"11111111",
    "endkey":"22222222",
},
{   
    "startkey":"22222222",
    "endkey":"33333333",
},
{
    "startkey":"33333333",
    "endkey":"44444444",
},
{
    "startkey":"88888888",
    "endkey":"99999999",
},
{
    "startkey":"99999999",
    "endkey":"aaaaaaaa",
},
{
    "startkey":"aaaaaaaa",
    "endkey":"bbbbbbbb",
},
{
    "startkey":"eeeeeeee",
    "endkey":"INFINITY",
}

This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.

2)

byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)

This has same issue:

{
    "startkey":"-INFINITY",
    "endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
    "startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
    "endkey":"33333332",
}
{
    "startkey":"33333332",
    "endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
    "startkey":"\\xE6ffffffa",
    "endkey":"INFINITY",
}

3) I tried supplying start key and end key and got following useless regions.

hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
                               Bytes.toBytes("01253|201501|805|1999"), 10);
{
    "startkey":"-INFINITY",
    "endkey":"04120|200808|805|1999",
}
{
    "startkey":"04120|200808|805|1999",
    "endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
    "startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
    "endkey":"000ptq<200wp6\\xBC805|1999",
}
{
    "startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
    "endkey":"01253|201501|805|1999",
}
{
    "startkey":"01253|201501|805|1999",
    "endkey":"INFINITY",
}
Tom Lord
  • 27,404
  • 4
  • 50
  • 77
nir
  • 3,743
  • 4
  • 39
  • 63

1 Answers1

6

First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.

But underlying thing is,

With your rowkey design, data should be distributed across the regions and not hotspotted (36.1. Hotspotting)

However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.

In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).

Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting. like enter image description here

EDIT : If you want to implement own regionSplit you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override

public byte[][] split(int numberOfSplits)

Second question : My understanding : You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.

  • If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..

  • you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.

  • to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter

for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()' which displays all your 100 rowkeys.

if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..

echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • Thanks, I tried to dump all rowkey as well using `count 'tablename', INTERVAL => 1` but doing that seemed equally time consuming. – nir Jun 02 '16 at 05:03
  • yes it will be taking some time, if you have huge data – Ram Ghadiyaram Jun 02 '16 at 12:37
  • Thanks for polishing answer. I am not focusing on salting or data skew at this point. I just want to see regions with some meaning full boundaries (startkey and endkey). I did found out what startkey and endkey I should use but none of the APIs giving me expected result currently. I get meaningless boundaries basically. I think it has to do with rowkey format in my application and how hbase admin api and RegionSplitter algorithms works. I will edit my question further about what I am seeing after using all these approaches of splitting. – nir Jun 02 '16 at 23:08
  • Yes pls do .. I am not sure how far my answer was helping you , along with presplitting of empty regions, rowkey design is key thing which we need to give special attention to . I would suggest meaning ful rowkey with business parameters to query. For e.g. saltedprefix + source+ envt + cobdate + murmurhash(contentof the row) in this you can query row with source or envt or cobdate specific records easily.. also salted prefix avoids hotspotting and ensure that data is equally distributed across region servers. If you would like to know more I can enter in to chat. Thx – Ram Ghadiyaram Jun 03 '16 at 03:34
  • Agree Ill have to add saltprefix to each key that I add otherwise I am not going to get any advantage from pre split. On a side note, this whole thing I am doing is to test if I get concurrent write performance. So does having multiple regions beforehand helps with concurrent write performance? I am writing with 100 threads from multiple client machine. – nir Jun 03 '16 at 04:46
  • 1
    yes It will help concurrent write performance. Mapreduce + hbase will do concurrent writes. Infact we have used multithreaded mappers(https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/mapreduce/MultithreadedTableMapper.html) for map only job which writes in to hbase, where each mapper can set number of threads you want. – Ram Ghadiyaram Jun 03 '16 at 05:15
  • lets say saltprefix is not what I want to do now. In that case my options are I think to provide all the individual split points or implement my own `RegionSplitter.SplitAlgorithm` right? – nir Jun 03 '16 at 05:23
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113710/discussion-between-ramprasad-g-and-nir). – Ram Ghadiyaram Jun 03 '16 at 08:17
  • http://chat.stackoverflow.com/rooms/113710/discussion-between-ramprasad-g-and-nir – Ram Ghadiyaram Jun 03 '16 at 08:18
  • yes you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm – Ram Ghadiyaram Jun 03 '16 at 08:23