1

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.

Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).

select count(*) from tbl_name -- returns double time

But ,

select count(col_name) from tbl_name -- returns actual count.

Same table when I tried from Impala after

INVALIDATE METADATA

I could see only data count which is available in HDFS(not duplicate).

How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?

Cœur
  • 37,241
  • 25
  • 195
  • 267
William R
  • 739
  • 2
  • 13
  • 34
  • That's interesting. I haven't ever had that happen in Hive but I will try it out in my system and see what happens. – Jared Aug 31 '16 at 18:26
  • Yeah i'm not having that issue. I took an external table which only had one csv file in it. Then copied that file so there were two files in the HDFS directory and the count showed double. Then deleted the copied file and the count returned to normal. – Jared Aug 31 '16 at 18:34
  • seems same sort of thread not sure about solution, http://stackoverflow.com/questions/24080276/wrong-result-for-count-in-hive-table , I don't have any stats dedicated DB ,Does it make any sense. – William R Aug 31 '16 at 19:40
  • Very strange , Hive CLI returns correct count , But Hue returns wrong count for count(*) – William R Aug 31 '16 at 20:00
  • FYI - I was running the queries through Hue. – Jared Sep 01 '16 at 00:48
  • It worked after analyze table tbl_name partition(a,b,c) compute statistics; – William R Sep 01 '16 at 08:20

1 Answers1

1

Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.

The solution is:

  1. to switch-off statistics usage in such cases:

    set hive.compute.query.using.stats=false;

  2. to analyze table as you mention in your comment:

    analyze table tbl_name partition(a,b,c) compute statistics;

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • yep, it looks like the config setting set hive.compute.query.using.stats=false; was set as a default in my environment, which is why i had no issues. – Jared Sep 01 '16 at 14:09