3

To better illustrate the question i will start with an exemple. Lets say, you have 3 tables :

  • Students (2 Million rows)
  • TestResults (100 Million rows)
  • Test (100 000 rows).

We group all the relevant data and add a denormalized table named 'TestSynthesis' with a key: TestID_StudentID_TestResultsID. The order of the fields in the key does not matter, because the user requires to be able to search by any or all of the 3 parts of this key.

One example query would be: "give me all test results of all students whose first name starts with 'John'"

A simple solution would be to :

  1. Search the normalized table Students
  2. Extract the StudentIDs (lets say we have 3 ids : 0001000,0000999,0000001)
  3. Then search the TestSynthesis on these ids (like *0001000* , *0000999* , *0000001*)

But what if at step 2, we get 1 million rows instead of 3 students?

Can HBase search the key only (e.g. without reading the data), or does it need to do a full table scan ?

Anton
  • 1,181
  • 2
  • 17
  • 27

1 Answers1

15

first the painful facts.. As mentioned in other posts, HBase does not fully support partial keys. Lets say you have your key the following way, as you mentioned: TestID_StudentID_TestResultsID

You can't do a search by the StudentID, and enjoy nice performance in the same time. However, there is a way to filter the rows during a scan, only depending on the row key.But remember, filters have a worse performance than a good optimized table schema. So out of understanding/experience, you have the following options:

1) You try to design the scheme so that it is optimized for a big number of queries. For example, the majority of the queries will need data for a given student,so you put the studentID in the beginning of the key. Than some queries need to know the results of a current student for a test, and then you do a scan, with {STARTROW=>given_studentID, STOPROW=>given_studentID+1} and filter the results you got, using keyfilters. The keyfilters can be setup to filter for a particular testID. But in this case, If you would like to see which students participated on a test, you need to do a full scan and apply filters to the result set and this is not very "cheap" when we are talking about performance.

2) As suggested in [http://stackoverflow.com/questions/12806762/hbase-data-only-in-key-compoung-key-wildcards][1], you could also have the data duplicated in two different tables, first as TestID_StudentID_TestResultsID and a second one like this: StudentID_TestID_TestResultsID. This way you can do queries by a given testID or StudentId because you can search happily in either the first, or the second table.Again you are in trouble if you want to know which student holds the best results on all tests. In this case you once again need a full scan. It is important to know what kind of queries you are going to have, so that you can plan your schema appropriately. (Do you really need the test results in the key anyway?)

3)There is also a third option, I used once. It depends pretty much on the data itself. Lets say you are going to have 500 000 Students, around 50 tests and then 50*500 000 test results. In this case I would design the schema the following way: TestID_StudenIT_Results

*If you want to see how students did on a particular test, you can scan according to the test, since the key starts with it.

*If you want to see how a Student did on all of its tests, you could do 50 scans each of which is using one of the 50 testIDs in the beginning, so that the scans would look like:

test1_studentX

test2_studentX

test3_studentX

test4_studentX

...

...

test50_studentX

50 scans might sound too much, but it suits super for a map reduce job. Here is the question, do you really have 50 testIDs, or is the number you are having suitable for this?

*If you want to see how all students did on all tests, you still have to do a full scan using this schema.

that's from me!

cheers, Niko

Niko
  • 260
  • 2
  • 11