Does msck repair table require hadoop/map-reduce?

Question

I'm looking to run Hive without bothering to run hadoop/map-reduce.

I want users to use hive just for metadata and to use spark, presto, etc for queries/execution.

I think this will generally work, but I'm concerned about a few administrative commands. Specifically, I need to know how msck repair table works.

Does this command require map-reduce to function, or does hive handle it in the metastore/etc?

score 1 · Accepted Answer · edited Jun 07 '19 at 20:46

1

Map Reduce binaries as such are not required for

msck repair table.

Map Reduce (MR) is a concept for large scale computations in parallel.

Hive will use Map Reduce if you do not use Impala or other execution engines for processing - like Spark.

See https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_hive_troubleshooting.html#hive_msck_repair_table_best_practices.

In any event using HDFS etc. implies Hadoop installatiion and you get all the MR goodies anyway.

You can of course run Spark without Hadoop. That said, some of its functionality rely on Hadoop binaries - e.g. Parquet.

EDIT - Pulling this in from the comments while accepting as it's very useful:

This linked answer goes into depth on how msck repair works under the covers and makes it clear that map-reduce is not triggered by it.

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

edited Jun 07 '19 at 20:46

John Humphreys

37,047
37
155
255

answered Jun 06 '19 at 20:52

thebluephantom

16,458
8
40
83

I understand map reduce and spark, etc quite well. I just didn’t understand if hive would use the execution engine to crawl directories for the command. Still not 100% sure; the link says the command is very expensive but doesn’t mention if it is done by the meta store or the execution engine. You’re staying it doesn’t require the execution engine though right? – John Humphreys Jun 06 '19 at 21:14
JH, with your rating I would expect that indeed! But why would it? Quote : " Consider table with multiple partition keys (2-3 partition keys is common in practice). msck repair will have to do a full-tree traversal of all the sub-directories under the table directory, parse the file names, make sure that the file names are valid, check if the partition is already existing in the metastore and then add the only partitions which are not present in the metastore. ..." from https://stackoverflow.com/questions/53667639/what-does-msck-repair-table-do-behind-the-scenes-and-why-its-so-slow – thebluephantom Jun 06 '19 at 21:22
I hope u r convinced. – thebluephantom Jun 06 '19 at 21:31
Don't worry, I eventually come back to answers, don't have to chase :) . Just been a long work week. Your answer is correct but you didn't explain how msck repair table works (and why it does or doesn't need MR), you kind of just talked about execution engines. The link you gave in the comment is very useful and answers my question well though so I pulled that into your answer and accepted. Thanks! – John Humphreys Jun 07 '19 at 20:44
True but if one considers partitioning in Hive it could only be as the link explains. Which for posterity I knew and granted I did not originally explain. Refering to others is a recognition and I marked them up in the past. Success – thebluephantom Jun 07 '19 at 20:50

Does msck repair table require hadoop/map-reduce?

1 Answers1