0

Considering a typical data warehouse scenario in hive with fact and dimension tables, say the fact table is split across multiple data nodes with partitions. While joining fact tables (which are partitioned) with dimensions (which are not partitioned), it seems logical to use Map joins as size of dimension tables are small & they are to be stored in-memory to efficiently join with fact data across all nodes.

But, few online resources suggest that for Map Joins to be performed on partitioned tables, the partitioned key on both the tables should be the same as join key.

So, this is the question I am looking answers for :
Can a partitioned table (fact) be MAP joined with an non-partitioned table (dimension)?

1 Answers1

0

The answer is - Yes

Map Join Operator

Demo

create table fact (rec_id int,dim_id int) partitioned by (dt date);
create table dim  (dim_id int,descr string);

explain
select  *
from    fact f join dim d
        on d.dim_id = f.dim_id

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: dim_id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 dim_id (type: int)
                  1 dim_id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: f
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: dim_id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 dim_id (type: int)
                  1 dim_id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col6, _col7
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: date), _col6 (type: int), _col7 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88