0

I have a table in Hive.

When I ran the command show tblproperties myTableName, It gives below result:

numFiles        12
numRows         1688092
rawDataSize     934923162
totalSize       936611254

That means rawDataSize is 934.92 MB and totalSize is 936.61 MB

And when I ran command to calculate data size on HDFS table location for the same table.

[user@server1 ~]$ hdfs dfs -du -h -s /apps/hive/warehouse/test.db/myTableName
893.2 M  /apps/hive/warehouse/test.db/myTableName

The result data size is 893.2 MB

I see that there is big difference in datasize here for the same table. I am trying to understand why there is difference in the data size here for the same table and looking for detailed explanation.

Table Type - MANAGED_TABLE

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • All this formatting in the question for nothing :D – philantrovert Apr 20 '17 at 15:02
  • @philantrovert I should have done more research before asking this. :P I got confused from the result of online converter as by default it was showing in decimal,But yes, for me there was one outcome that I got to know the difference between `rawDataSize` and `totalSize` :) – Sandeep Singh Apr 21 '17 at 04:11

1 Answers1

3

936611254 / 1024 / 1024 = 893.2 M

Mike Gan
  • 339
  • 2
  • 8
  • Thanks for the clarification, my bad, I was using online converter to convert bytes in MB and by default It was showing in decimal. The actual result is in binary here. – Sandeep Singh Apr 19 '17 at 09:26
  • more interesting question is, why the data gets increased after being stored in Hive? it's original (raw) size was `934923162` and in HDFS it is reflected as `936611254` – mangusta Jun 21 '18 at 03:06