4

I have a hbase table called "users", rowkey consists of three parts:

  1. userid
  2. messageid
  3. timestamp

rowkey looks like: ${userid}_${messageid}_${timestamp}

Given I can hash the userid and make the length of the field fixed, is there anyway I can do a query like SQL query:

select distinct(userid) from users

If rowkey doesn't allow me to query like this, does that mean I need to create a separated table just contains all the user ids? I guess if I do something like that, it won't be atomic anymore when I insert a record in, becoz I am dealing with two tables without transaction.

Shengjie
  • 12,336
  • 29
  • 98
  • 139

2 Answers2

2

You can do that but as a map/reduce job not a direct query

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
0

you can use HashSet to do that. Something like this :

public Set<String> getDistinctCol(String tableName,String colFamilyName, String colName)
   {
    Set<String> set = new HashSet<String>();
    ResultScanner rs=null;
    Result r = null;
    String s = null;
    try 
    {
        HTable table = new HTable(conf, tableName);
        Scan scan = new Scan();
        scan.addColumn(Bytes.toBytes(colFamilyName),Bytes.toBytes(colName));
        rs = table.getScanner(scan);
        while((res=rs.next()) != null)
        {
            byte [] col = res.getValue(Bytes.toBytes(colFamilyName+":"+colName));                
            s = Bytes.toString(col);
            set.add(s);
        }
    } catch (IOException e) 
    {
        System.out.println("Exception occured in retrieving data");
    }
    finally
    {
        rs.close();
    }
    return set;

*col in your case is userID.

HTH

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • So u r recommending i also keep userid as a col, then do a full table scan whenever I want to get a set of userids. It's feeling a bit expensive tho. – Shengjie Dec 21 '12 at 01:17
  • alternatively you can do the same thing with rowkeys. extract userID part from each rowkwy and put in the set. but there is no direct way to do this, as Arnon has said. one suggestion from my side, if you don't mind. Why don't you use Hive for these kind of things. It would be much much easier then. It'll save a lot of time and being similar to SQL it'll be more comfortable, in case you are coming from an SQL background. You can either store the data in Hive warehouse directly or map an existing Hbase table where you already have your data. – Tariq Dec 21 '12 at 10:22
  • I've tried running Hbase on Hive, the performance, as you might already know, it's poor since it's mapred underneath. I will need real time response, I am thinking to have a separated hbase table to just store users. But as I mentioned in the original question, Atomicity is a concern there. – Shengjie Dec 21 '12 at 10:30
  • 1
    You are right. But we can't say that Hive's performance is bad. It is meant for entirely different purpose, i.e "batch processing" and not for real time stuff. Same holds good for MR and Pig as well. If you really want this to be real time, you have to think of some alternative, like you have said above. – Tariq Dec 21 '12 at 10:47
  • This solution will be troublesome in the case of a huge data set, which is usually the case with HBase. – Apurv Nerlekar Feb 08 '17 at 01:13