Hive query to quickly find table size (number of rows)

Question

Is there a Hive query to quickly find table size (i.e. number of rows) without launching a time-consuming MapReduce job? (Which is why I want to avoid COUNT(*).)

I tried DESCRIBE EXTENDED, but that yielded numRows=0 which is obviously not correct.

(Apologies for the newb question. I tried Googling and searching the apache.org documentation without success.)

Check this [link](https://stackoverflow.com/a/45304801/7043815)!!! hope it helps :) — satish silveri, Jul 25 '17 at 13:31

score 47 · Accepted Answer · answered Mar 21 '16 at 13:00

47

tblproperties will give the size of the table and can be used to grab just that value if needed.

-- gives all properties
show tblproperties yourTableName

-- show just the raw data size
show tblproperties yourTableName("rawDataSize")

answered Mar 21 '16 at 13:00

Jared

2,904
6
33
37

3

Apparently the given command only works when these properties are available to the column, which is not there by default. – Rodrigo Hjort Mar 16 '17 at 20:46
2

Yes the output is bytes. Also, this only works for non-partitioned tables which have had stats run on them. – Jared Apr 12 '17 at 13:25

score 31 · Answer 2 · edited Nov 20 '19 at 19:13

31

Here is the quick command

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

For Example,If table is partitioned

 hive> ANALYZE TABLE ops_bc_log PARTITION(day) COMPUTE STATISTICS noscan;

output is

Partition logdata.ops_bc_log{day=20140523} stats: [numFiles=37, numRows=26095186, totalSize=654249957, rawDataSize=58080809507]

Partition logdata.ops_bc_log{day=20140521} stats: [numFiles=30, numRows=21363807, totalSize=564014889, rawDataSize=47556570705]

Partition logdata.ops_bc_log{day=20140524} stats: [numFiles=35, numRows=25210367, totalSize=631424507, rawDataSize=56083164109]

Partition logdata.ops_bc_log{day=20140522} stats: [numFiles=37, numRows=26295075, totalSize=657113440, rawDataSize=58496087068]

edited Nov 20 '19 at 19:13

sumitya

2,631
1
19
32

answered Jun 03 '14 at 10:29

minhas23

9,291
3
58
40

1

Note, per the [documentation](https://cwiki.apache.org/confluence/display/Hive/StatsDev) the "noscan" parameter will prevent a scan the files to get a correct count of the number of rows. You will get total size though. In your case, it might have worked because you were gathering stats by config. – Daniel Langdon Jun 27 '14 at 19:57
@DanielLangdon are the units of rawDataSize bytes, blocks or something else? – dlamblin Sep 16 '16 at 04:16
bytes @DanielLangdon – Bargitta May 20 '19 at 08:27
Sorry guys, I moved positions and companies a while ago and do not remember much details. – Daniel Langdon May 20 '19 at 20:41

score 11 · Answer 3 · answered Mar 31 '17 at 13:43

11

How about using :

    hdfs dfs -du -s -h /path/to/table/name

answered Mar 31 '17 at 13:43

user2720864

8,015
5
48
60

I tried this and it says 12.8 G does it stand for 12.8 GB? – Constantine1991 Jun 04 '19 at 17:38
1

@Constantine1991 yes ! – user2720864 Jun 05 '19 at 08:32

score 1 · Answer 4 · answered May 14 '18 at 15:09

solution, though not quick
if the table is partitioned, we can count the number of partitions and count(number of rows) in each partition.
For example:, if partition by date (mm-dd-yyyy)

select partition_date, count(*) from <table_name> where <partion_column_name> >= '05-14-2018' group by <partion_column_name>

score -2 · Answer 5 · answered Jan 22 '16 at 11:41

-2

Use parquet format to store data of your external/internal table. Then you will get quicker results.

answered Jan 22 '16 at 11:41

Kishore Kumar

5
1

score -6 · Answer 6 · edited May 14 '18 at 15:13

-6

It is a good question. the count() will take much time for finding the result. But unfortunately, count() is the only way to do.

There is an alternative way(can't say alternate but better latency than above case) :

Set the property

set hive.exec.mode.local.auto=true;

and run the same command ( select count(*) from tbl ) which gives better latency than prior.

edited May 14 '18 at 15:13

Neels

2,547
6
33
40

answered Jan 20 '14 at 12:35

vijay kumar

2,049
1
15
18

There are indeed other ways. See other answer below. – Daniel Langdon Jun 27 '14 at 19:54

Hive query to quickly find table size (number of rows)

6 Answers6

Linked