12

If one runs DESCRIBE EXTENDED command on any hive table the result presents totalSize and rawDataSize values near the end of the output.

What do these fields mean?

Ex:

hive > DESCRIBE EXTENDED <TableName>

Output Results:

Table(tableName:TablenameXXXXX, dbName:XXxXXX,
..........       .......................
numRows=116429472, totalSize=3835205544, rawDataSize=35040221600})
Henin RK
  • 298
  • 1
  • 2
  • 14

3 Answers3

16

rawDataSize is the size of original data set, totalSize is amount of storage it takes. It is applicable for ORC file format, as it compresses the data totalSize will be lesser than rawDataSize.

Durga Viswanath Gadiraju
  • 3,896
  • 2
  • 14
  • 21
  • Does the totalSize reflect used space in terms of the used data portion only, or does it included used data portion and any unused portion of an HDFS block? – Henin RK Jan 06 '16 at 09:35
  • Is the totalSize a multiple of the HDFS block size? – Henin RK Jan 06 '16 at 09:37
  • It need not be multiple of block size. HDFS does not waste storage on the last block of the file. If the file size is 200 MB and block size is 128 MB, then first block will be of size 128 MB and second block will be of 72 MB – Durga Viswanath Gadiraju Jan 06 '16 at 09:53
  • 2
    @Neal : the totalSize is in Bytes according to this [wiki](https://cwiki.apache.org/confluence/display/Hive/StatsDev) – Vincent Lous Feb 09 '17 at 20:52
  • 2
    in case of `Parquet` format, the `rawDataSize` is still smaller than `totalSize`, but how is it possible? Parquet format should compress the original data. I'm totally confused – mangusta Jun 21 '18 at 02:57
  • @DurgaViswanathGadiraju what it the unit of rawDataSize here bit/byte? – Indrajeet Gour Oct 12 '18 at 06:52
7

The meaning of the fields is:

  • totalSize - the total size in bytes of the physical files on disk where table data is stored.
  • rawDataSize - is the sum of each datatype size of the columns multiplied by the number of rows in the table. This is also used as an estimate for the query optimizer (e.g. determining if a table is small enough to do a mapjoin instead of simple join).
Eugen
  • 2,292
  • 3
  • 29
  • 43
2

The size of data is described by two statistics:

  • totalSize — Approximate size of data on disk
  • rawDataSize — Approximate size of data in memory

Hive on MapReduce uses totalSize. When both are available, Hive on Spark uses rawDataSize. Because of compression and serialization, a large difference between totalSize and rawDataSize can occur for the same dataset.

Leonel Atencio
  • 474
  • 3
  • 14