How does one run compute stats on a subset of columns from a hive table using Impala?

Question

I have a very long and wide hive table that is taking an exorbitant amount of time to return query results. As such, I attempted a 'compute stats' on the table but due to the width of the table, this operation often times out. As such, I was wondering if there is a way to run 'compute stats' on select columns? Documentation on the Cloudera website suggests that it is possible but the syntax does not work.

Here is what I've tried but to no avail. Please advise as these all result in syntax errors.

COMPUTE STATS database.table field1
COMPUTE STATS database.table field1, field2
COMPUTE STATS database.table (field1, field2)

score 1 · Accepted Answer · answered Jun 09 '20 at 21:06

1

After further research, it was confirmed that the syntax is, in fact, correct but the column list parameter for compute stats was not made available until CDH 5.15.x. I am leaving this here in case anybody comes across the same issue.

answered Jun 09 '20 at 21:06

OTM

186
3
14

score 1 · Answer 2 · answered Jun 10 '20 at 08:47

Hive works by creating jobs that run in a different engine (originally MapReduce, which can be rather slow) and the underlying engine can be changed. Rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce. Newer versions of Hive also support an architecture called LLAP (Live Long And Process) which caches metadata similarly to Impala, reducing query latency. You may want to test some typical queries against your own tables to see if one of these works better for you than Impala for interactive and ad-hoc queries.

UNDERSTANDING EXECUTIONS PLANS

To get a true grasp on what causes a query to take a long time, you need to understand what operations Hive or Impala will perform when it executes a query. To find this out, you can view the execution plan for a query. The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.

To see an execution plan for a query, you can do this:

Prefix the query with the keyword EXPLAIN, then run it.

Execution plans can be long and complex. Fully understanding them requires a deep knowledge of MapReduce.

The execution plans provided by Hive and by Impala look slightly different, but at a basic level, they provide more or less the same information.

Hive explain plan understanding

TABLE AND COLUMNS STATISTICS

The SQL engines you use do a certain amount of optimizing of the queries on their own—they look for the best way to proceed with your query, when possible. When the query uses joins, the optimizers can do a better job when they have table statistics and column statistics. For the table as a whole, these statistics include the number of rows, the number of files used to store the data, and the total size of the data. The column statistics includes the approximate number of distinct values and the maximum and average sizes of the values (not the maximum or average value, but rather the size used in storage). The optimizers use this information when deciding how to perform the join tasks. Statistics also help your system prevent issues due to memory usage and resource limitations.

These statistics are not automatically calculated—you have to manually trigger it using a SQL command.

Once statistics are computed, both Hive and Impala can use them, though if you compute them in Hive, you need to refresh Impala's metadata cache. If you make any changes to the table, such as adding or deleting data, you'll need to recompute the statistics.

Both Hive and Impala can use the statistics, even when calculated by the other machine. However, when you have both Impala and Hive available, Cloudera recommends using Impala's COMPUTE STATS command to calculate and view the statistics. The method for Hive is a bit more difficult to use. If you do use Hive, you must refresh Impala's metadata cache for the table if you want Impala to use the statistics.

Statistics in Impala

Impala's syntax for calculating statistics for a table (including statistics for all columns) is COMPUTE STATS dbname.tablename; If the table is in the active database, you can omit dbname. from the command.

To see the statistics in Impala, run SHOW TABLE STATS dbname.tablename; or SHOW COLUMN STATS dbname.tablename;

Note: If the statistics have not yet been computed, #Rows for the table shows -1. The #Nulls statistics for each column will always be -1; old versions of Impala would calculate this statistic, but it is not used for optimization, so newer versions skip it.

Statistics in Hive

Hive's syntax for calculating statistics for a table is ANALYZE TABLE dbname.tablename COMPUTE STATISTICS; If the table is in the active database, you can omit dbname. from the command. To calculate column statistics, add FOR COLUMNS at the end of the command.

To see the table statistics in Hive, run DESCRIBE FORMATTED dbname.tablename; The Table Parameters section will include numFIles, numRows, rawDataSize, and totalSize. To see the statistics for a column, include the column name at the end: DESCRIBE FORMATTED dbname.tablename columnname; You can only display column statistics one column at a time.

How does one run compute stats on a subset of columns from a hive table using Impala?

2 Answers2