TL;DR I can execute a Hive query using an INNER JOIN on a HBase table in the Hive (Hive Server 1) command line and return the correct rows. However, the same query on the Beeline (Hive Server 2) command line returns no rows. I can do an INNER JOIN fine on regular Hive on HDFS tables fine.
I've replicated it on the following MapR environments:
MapR version: 4.0.1.27334.GA -- Hive version: hive-0.13
MapR version: 5.1.0.37549.GA -- Hive version: hive-1.2.0
I made two HBase tables and populated them like so (Note that in MapR, we use partitions as the base name. If you want to replicate this on a non-MapR environment, remove the /app/my_partition/
part):
create '/app/my_partition/HiveParent', 'f'
create '/app/my_partition/HiveChild', 'f'
put '/app/my_partition/HiveParent', 'foo|a|', 'f:foo', 'a'
put '/app/my_partition/HiveParent', 'foo|b|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:bar', '2'
put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:bar', '2'
I created the Hive on HBase tables in the hive shell like so:
CREATE EXTERNAL TABLE HiveParent(rk string, foo string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'='f:foo'
) TBLPROPERTIES (
'hbase.table.name'='/app/my_partition/HiveParent'
);
CREATE EXTERNAL TABLE HiveChild(rk string, foo string, bar string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'='f:foo,f:bar'
) TBLPROPERTIES (
'hbase.table.name'='/app/my_partition/HiveChild'
);
All of the following queries work successfully in Hive (Hive Server 1) and Beeline (Hive Server 2):
SELECT * FROM HiveParent;
SELECT foo FROM HiveParent;
SELECT foo FROM HiveParent WHERE foo IN ('a', 'b');
SELECT * FROM HiveChild;
SELECT foo, bar FROM HiveChild;
SELECT foo, bar FROM HiveChild WHERE foo IN ('a', 'b');
These queries only return rows in Hive (Hive Server 1). However, they return 0 rows in Beeline (Hive Server 2):
SELECT * FROM HiveParent INNER JOIN HiveChild ON (HiveParent.foo = HiveChild.foo);
SELECT * FROM HiveParent, HiveChild WHERE HiveParent.foo = HiveChild.foo;
SELECT * FROM HiveChild WHERE HiveChild.foo IN (SELECT HiveParent.foo FROM HiveParent);
Edit: I have cross posted this question to MapR's question and answer web site because this didn't generate any attention. I will post the answer back here if it gets one.