0

TL;DR I can execute a Hive query using an INNER JOIN on a HBase table in the Hive (Hive Server 1) command line and return the correct rows. However, the same query on the Beeline (Hive Server 2) command line returns no rows. I can do an INNER JOIN fine on regular Hive on HDFS tables fine.

I've replicated it on the following MapR environments:

MapR version: 4.0.1.27334.GA -- Hive version: hive-0.13
MapR version: 5.1.0.37549.GA -- Hive version: hive-1.2.0

I made two HBase tables and populated them like so (Note that in MapR, we use partitions as the base name. If you want to replicate this on a non-MapR environment, remove the /app/my_partition/ part):

create '/app/my_partition/HiveParent', 'f'
create '/app/my_partition/HiveChild', 'f'

put '/app/my_partition/HiveParent', 'foo|a|', 'f:foo', 'a'
put '/app/my_partition/HiveParent', 'foo|b|', 'f:foo', 'b'

put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:bar', '2'

put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:bar', '2'

I created the Hive on HBase tables in the hive shell like so:

CREATE EXTERNAL TABLE HiveParent(rk string, foo string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1',
  'hbase.columns.mapping'='f:foo'
) TBLPROPERTIES (
  'hbase.table.name'='/app/my_partition/HiveParent'
);

CREATE EXTERNAL TABLE HiveChild(rk string, foo string, bar string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1',
  'hbase.columns.mapping'='f:foo,f:bar'
) TBLPROPERTIES (
  'hbase.table.name'='/app/my_partition/HiveChild'
);

All of the following queries work successfully in Hive (Hive Server 1) and Beeline (Hive Server 2):

SELECT * FROM HiveParent;
SELECT foo FROM HiveParent;
SELECT foo FROM HiveParent WHERE foo IN ('a', 'b');
SELECT * FROM HiveChild;
SELECT foo, bar FROM HiveChild;
SELECT foo, bar FROM HiveChild WHERE foo IN ('a', 'b'); 

These queries only return rows in Hive (Hive Server 1). However, they return 0 rows in Beeline (Hive Server 2):

SELECT * FROM HiveParent INNER JOIN HiveChild ON (HiveParent.foo = HiveChild.foo);

SELECT * FROM HiveParent, HiveChild WHERE HiveParent.foo = HiveChild.foo;

SELECT * FROM HiveChild WHERE HiveChild.foo IN (SELECT HiveParent.foo FROM HiveParent); 

Edit: I have cross posted this question to MapR's question and answer web site because this didn't generate any attention. I will post the answer back here if it gets one.

Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231

2 Answers2

0

This is not an appropiate solution, but it works.

set hive.auto.convert.join = false

What does hive.auto.convert.join do?

If hive.auto.convert.join is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible

So, hive by default attempts to convert joins into map-side joins. A Map-side join is one in which the smaller of the two tables is fully loaded into the mappers memory and is much more efficient. By setting hive.auto.convert.join to false, we are disabling map-side joins and forcing all joins to be reduce-side joins, which are much more inefficient.

Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231
0

We found this solution out incidentally but it might be worth looking into. On one environment, beeline was giving OutOfMemory errors and was hanging. The sysadmins increased the memory allocated to HS2, and after that the INNER JOIN worked fine. However, this is unsettling, because Beeline did not throw any errors but simply returned no rows.

Increasing the memory on the MapR5.1/Hive-1.2 environment from 4GB to 8GB worked. However, on the MapR4.0.1/Hive-0.13 environment, this did not work.

In hive-env.sh:

export HADOOP_HEAPSIZE=8000
Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231