10

I have created a table in HIVE

CREATE TABLE IF NOT EXISTS daily_firstseen_analysis (
    firstSeen         STRING,
    category          STRING,
    circle            STRING,
    specId            STRING,
    language          STRING,
    osType            STRING,
    count             INT)
    PARTITIONED BY  (day STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS orc;

count(*) is not giving me correct result for this table

hive> select count(*) from daily_firstseen_analysis;
OK
75
Time taken: 0.922 seconds, Fetched: 1 row(s)

While the number of rows in this table is 959 rows

hive> select * from daily_firstseen_analysis;
....
Time taken: 0.966 seconds, Fetched: 959 row(s)

it gives data with 959 rows

hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day) COMPUTE STATISTICS noscan; 
    Partition logdata.daily_firstseen_analysis{day=20140521} stats: [numFiles=6, numRows=70, totalSize=4433, rawDataSize=37202]
    Partition logdata.daily_firstseen_analysis{day=20140525} stats: [numFiles=6, numRows=257, totalSize=4937, rawDataSize=136385]
    Partition logdata.daily_firstseen_analysis{day=20140523} stats: [numFiles=6, numRows=211, totalSize=5059, rawDataSize=112140]
    Partition logdata.daily_firstseen_analysis{day=20140524} stats: [numFiles=6, numRows=280, totalSize=5257, rawDataSize=148808]
    Partition logdata.daily_firstseen_analysis{day=20140522} stats: [numFiles=6, numRows=141, totalSize=4848, rawDataSize=74938]
    OK
    Time taken: 5.098 seconds

I am using hive with version Hive 0.13.0.2.1.2.0-402

NOTE: I found this issue in count(*) if We are inserting into a table in more than one time. Tables created with single insert do not have this issue

minhas23
  • 9,291
  • 3
  • 58
  • 40
  • Thats not possible. Can you make sure when you are inserting/loading record in Hive table at that time you are not running any other query on that table. – Mukesh S Jun 07 '14 at 19:06
  • @MukeshS I ran all insert queries sequentially.I can see all records with select * statement. Even analyze table give me all fine. Only issue is with count(*) on this table. count(*) on other tables works fine in my environment. Let me know if you need more information about this problem. – minhas23 Jun 09 '14 at 05:56
  • @MukeshS "That's not possible". So you are going to personally vouch that hive has zero bugs? BTW I have been using hive for two years in clusters up to over 500 nodes. I'd say hive has a bug or two.. – WestCoastProjects Jun 11 '14 at 07:34
  • If you run the ANALYZE command after you processed your multi inserts and run the count again the result should be ok? – Charles Feb 10 '15 at 09:33
  • @Charles ANALYZE is a good trick but I have the impression it doesn't work on EXTERNAL tables. Pretty bad bug IMHO – Andre de Miranda Jun 11 '15 at 18:11
  • @AndredeMiranda I don't see why this shouldn't work. At least the documentation doesn't say otherwise. – Charles Jun 15 '15 at 02:08
  • @Charles, that may be the case, still I got wrong results for COUNT when used together with JOIN. Bug is so underwhelming that Even considering moving to Drill and give Hive goodbye. – Andre de Miranda Jun 15 '15 at 10:38
  • @AndredeMiranda, you get the same result either you set tez or mr as the engine? – Charles Jun 16 '15 at 06:48

3 Answers3

12

I had the same problem, and using ANALYZE fixed it. Running these commands in order should give you the correct count:

hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day) COMPUTE STATISTICS;
hive> SELECT COUNT(*) FROM daily_firstseen_analysis;

i.e. you have to use the analyze command before the count. You have half the answer within your question.

Nadine
  • 1,620
  • 2
  • 15
  • 27
2

I execute ANALYZE TABLE ... at first is OK, but raise error when i try again.so i try:

hive> REFRESH TABLE daily_firstseen_analysis;
hive> SELECT COUNT(*) FROM daily_firstseen_analysis;

this is explain

enter image description here

Catarina Ferreira
  • 1,824
  • 5
  • 17
  • 26
libin
  • 420
  • 3
  • 7
1

if you have an external table, remove all the files in HDFS, and insert into the table again then select count(*) will be incorrect.

Karl Dailey
  • 118
  • 8