8

I have a very large table in Hive, from which we need to load a subset of partitions. It looks something like this:

CREATE EXTERNAL TABLE table1 (
    col1 STRING
) PARTITIONED BY (p_key STRING);

I can load specific partitions like this:

SELECT * FROM table1 WHERE p_key = 'x';

with p_key being the key on which table1 is partitioned. If I hardcode it directly in the WHERE clause, it's all good. However, I have another query which calculates which partitions I need. It's more complicated than this, but let's define it simply as:

SELECT DISTINCT p_key FROM table2;

So now I should be able to construct a dirty query like this:

SELECT * FROM table1
WHERE p_key IN (SELECT DISTINCT p_key FROM table2);

Or written as an inner join:

SELECT t1.* FROM table1 t1
JOIN (SELECT DISTINCT p_key FROM table2) t2 ON t1.p_key = t2.p_key

However, when I run this, it takes enough time to let me believe it's doing a full table scan. In the explain for the above queries, I can also see the result of the DISTINCT operation are used in the reducer, not the mapper, meaning it would be impossible for the mapper to know which partitions should be loaded or not. Granted, I'm not fully familiar with Hive explain output, so I may be overlooking something.

I found this page: MapJoin and Partition Pruning on the Hive wiki and the corrosponding ticket indicates it was released in version 0.11.0. So I should have it.

Is it possible to do this? If so, how?

KennethJ
  • 924
  • 1
  • 7
  • 16

1 Answers1

0

I'm not sure how to help with MapJoin, but in the worst case you could dynamically create second query with something like:

SELECT concat('SELECT * FROM table1 WHERE p_key IN (',
              concat_ws(',',collect_set(p_key)),
              ')')
  FROM table2;

then execute obtained result. With this, query processor should be able to prune unneeded partitions.

Kombajn zbożowy
  • 8,755
  • 3
  • 28
  • 60