3

How can I see from hive EXPLAIN is there a full table scan?

For example, is there a full scan? The table size is 993 rows.

The query is

explain select latitude,longitude FROM CRIMES WHERE geohash='dp3twhjuyutr'

I have secondary index on geohash column.

STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: crimes
                filterExpr: (geohash = 'dp3twhjuyutr') (type: boolean)
                Statistics: Num rows: 993 Data size: 265582 Basic stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: (geohash = 'dp3twhjuyutr') (type: boolean)
                  Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: latitude (type: double), longitude (type: double)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Markiza
  • 444
  • 1
  • 5
  • 18

1 Answers1

3
  1. Absence of partition predicate in the plan means full scan. Of course this is not about predicate push-down in ORC.
  2. Check Data size and Num rows in each operator.
  3. EXPLAIN DEPENDENCY command will show all input_partitions collection and you can check what exactly will be scanned.
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Please, look at the edit pf the question. Thank you – Markiza May 30 '19 at 08:24
  • @Markiza Your index is not used in the plan. Also the table is too small and using Hive index is unnecessary overhead. Also Hive Indexes is a bit outdated concept. Do not use Hive Indexes, use ORC and partitions if applicable. ORC has built-in indexes and you do not need to manage them, it is done automatically in case of ORC. Currently you are using Sequence file format. Read this: https://community.hortonworks.com/articles/68631/optimizing-hive-queries-for-orc-formatted-tables.html – leftjoin May 30 '19 at 09:07
  • Thank you, my goal was just to to try index in work and check the execution plan, so in my plan it's a full scan, right? Yes, I know, indexes are removed in 3rd Hive. – Markiza May 30 '19 at 12:00
  • @Markiza Right. Yes it is. – leftjoin May 30 '19 at 12:09