1

I am trying to do UNION ALL for three different tables with same DDL structure but the final output is generating zero rows. I have no clue with what's happening in the underlying execution. Could anybody share your thoughts on this? My sample Hive SQL goes as below. Thank you.

SET hive.execution.engine=tez;
    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.qubole.cleanup.partial.data.on.failure=true;
    SET hive.tez.container.size=8192;
    SET tez.task.resource.memory.mb=8192;
    SET tez.task.resource.cpu.vcores=2;
    SET hive.mapred.mode=nonstrict;
    SET hive.qubole.dynpart.use.prefix=true;
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled =true;
    SET hive.cbo.enable=true;
    SET hive.compute.query.using.stats=true;
    SET hive.stats.fetch.column.stats=true;
    SET hive.stats.fetch.partition.stats=true;
    SET mapred.reduce.tasks = -1;
    SET hive.auto.convert.join.noconditionaltask.size=2730;
    SET hive.auto.convert.join=true;
    SET hive.auto.convert.join.noconditionaltask=true;
    SET hive.auto.convert.join.noconditionaltask.size=405306368;
    SET hive.compute.query.using.stats=true;
    SET hive.stats.fetch.column.stats=true;
    SET hive.stats.fetch.partition.stats=true;
    SET mapreduce.job.reduce.slowstart.completedmaps=0.8;


    CREATE  TABLE IF NOT EXISTS X STORED AS PARQUET AS 
      SELECT a,
             b,
             c
        FROM A
      UNION ALL
      SELECT a,
             b,
             c
        FROM B
      UNION ALL
      SELECT a,
             b,
             c
        FROM C;

If I try to run the below query on Presto it shows that there is data.

SELECT COUNT(1) FROM 
(
          SELECT a,
                 b,
                 c
            FROM A
          UNION ALL
          SELECT a,
                 b,
                 c
            FROM B
          UNION ALL
          SELECT a,
                 b,
                 c
            FROM C 
)Z;
Teja
  • 13,214
  • 36
  • 93
  • 155

1 Answers1

0

UNION ALL when running on Tez runs in parallel and creates extra sub-directories in the table location (check what is inside table location). Try to add these configuration settings before reading the table to allow Hive to read subdirectories:

set hive.mapred.supports.subdirectories=true; 
set mapred.input.dir.recursive=true;

Your query is quite simple and runs on mappers-only each subquery writing it's own sub-directory and does not interfere with another one.

Alternatively you can force extra reducer stage by adding distribute by at the end or order by (will run slower), run UNION instead of UNION ALL, apply filter after union etc - it will create files in the table folder without subdirectories:

CREATE  TABLE IF NOT EXISTS X STORED AS PARQUET AS 
select * from 
(
      SELECT a,
             b,
             c
        FROM A
      UNION ALL
      SELECT a,
             b,
             c
        FROM B
      UNION ALL
      SELECT a,
             b,
             c
        FROM C
      )s distribute by a; --this will force reducer step
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • My underlying files are stored on S3. Would that cause any issue? – Teja Mar 05 '19 at 08:21
  • @Teja To avoid eventual consistency in S3 (read after delete) after re-loading this table, apply method which create less number of files. distribute by low cardinality column or for example `distribute by FLOOR(RAND()*100.0)%20` - it will create 20 equal files and will run 20 reducers in parallel on final reducer vertex. But if possible, apply first method with recursive irectories – leftjoin Mar 05 '19 at 08:28
  • What could be the issue leftjoin? I made sure DDL's are exactly the same and ran the above query. It produces zero rows. :( – Teja Mar 05 '19 at 08:38
  • How did you check the data was not loaded? I suppose select count(*) from table after load? – leftjoin Mar 05 '19 at 08:41
  • SELECT COUNT(*) FROM ( UNION ALL QUERY ) in hive returns > 0 rows. I have checked in both presto and hive. – Teja Mar 05 '19 at 08:47