1

I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind, msck will scan the whole table data (we have about 200 days data) and then update metadata. My question is: is there a way let the msck only scan the last day data and then update the metadata to speed up the whole process? by the way there is no small files issue, I already merge the small files before msck.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Gary Wang
  • 81
  • 1
  • 1
  • 4

1 Answers1

0

When you creating external table or doing repair/recover partitions with this configuration:

set hive.stats.autogather=true;

Hive scans each file in the table location to get statistics and it can take too much time.

The solution is to switch it off before create/alter table/recover partitions

set hive.stats.autogather=false; 

See these related tickets: HIVE-18743, HIVE-19489, HIVE-17478

If you need statistics, you can gather statistics only for new partitions if necessary using

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  
  COMPUTE STATISTICS 

See details here: ANALYZE TABLE

Also if you know which partitions should be added, use ALTER TABLE ADD PARTITION - you can add many partitions in single command.

leftjoin
  • 36,950
  • 8
  • 57
  • 116