Hive optimization: How to evaluate number of mappers/reducers and resources according to the script?

Question

I have a complexed hql script, comprises a number of tables' join, union, row_number and grouping sets.

I know all ODS data involved in this task is 40GB, I feel confused on how to evaluate the number of mappers and reducers it will use and how many CPU cores/Memory (no. of containers) it will cost?

Any help is appreciated.

1.Check the plan (EXPLAIN) and you will understand the execution dag. 2.Number of mappers or reducers on each vertex(in case of Tez) depends on too many factors, not only the data size, everything matters: storage type, query plan, cluster capacity, configuration settings, etc. — leftjoin, Sep 09 '21 at 12:01
Apart from explain(which i feel give questionable output sometimes), you can run the query, monitor it, check all i/o parameters/cpu/memory and judge based on that. But like @leftjoin said, its very difficult to guess/set number of mapper/reducer and may be go with guts and adjust if it doesnt work. — Koushik Roy, Sep 09 '21 at 14:19
Well actually it is possible to more or less control the number of mappers/reducers depending on data size/splittability of files, or even not depending on that. This is what I mean: looking on query only one can not say how many mappers/reducers will be used. Too many factors involved. See this answer about controlling the number of mappers/reducers: https://stackoverflow.com/a/55449237/2700344 — leftjoin, Sep 09 '21 at 14:54
Also query can trigger reducers depending on data values (DISTRIBUTE BY) and bites per reducer . — leftjoin, Sep 09 '21 at 14:57

Hive optimization: How to evaluate number of mappers/reducers and resources according to the script?

0 Answers0