How to modify this hive query to increase number of reducers allocated?

Question

I am trying to execute this query using hive, but it takes forever to run, especially after going to the reducer step. It say mappers:451, reducers:1.

create table mb.ref201501_nontarget as select * from adv.raf_201501  where target=0 limit 200000;

My motivation to change the query came from this answer:

Hive unable to manually set number of reducers

I tried changing the query to:

create table mb.ref201501_nontarget as select * from (select * from adv.raf_201501 limit 200000)  where target=0;

but its throwing error.

In this case your last query has an obvious problem, but in general please include error messages. — Dennis Jaheruddin, Jun 20 '16 at 14:59
You cannot use subqueries as freely as in SQL. (If you think that is not the problem, it would be helpfull to have the actual error message) — Dennis Jaheruddin, Jun 20 '16 at 15:09
I see, I am kinda new to hive, so was executing the queries similar to SQL — user2542275, Jun 20 '16 at 15:15
Could you run an EXPLAIN on that query, and tell us why Hive wants to apply a Reducer step on a simple (but massive) SELECT WHERE? Is that because of the bloody LIMIT?? — Samson Scharfrichter, Jun 20 '16 at 22:15

score 0 · Answer 1 · answered Jun 20 '16 at 14:58

0

This question is very vague, if you think the last query produces the proper result (note that it is not the same as the first one!!) this should do the trick:

create table mytmptbl = select * from advanl.raf_201501 limit 200000;
create table mbansa001c.ref201501_nontarget as select * from (mytmptbl )  where target=0;

After which you probably want to delete the temporary table again.

answered Jun 20 '16 at 14:58

Dennis Jaheruddin

21,208
8
66
122

Thanks for quick response, why you find the question vague? – user2542275 Jun 20 '16 at 15:07
@user2542275 Because your question has a title, and a body, but from a simple read it is not that clear how they connect. – Dennis Jaheruddin Jun 20 '16 at 15:10
This still does not seem to make much improvement to the run time, and it gets stuck at reduce 33% and now running really slowly. I tried setting set mapred.reduce.tasks=120; But its still assigning reducers:1. How can I increase the reducers? – user2542275 Jun 20 '16 at 15:19
@user2542275 Which one gets stuck? The first query or the second one? Is your table by any chance very wide (100+ columns), because these are very simple querys that would normally not take more than a few minutes. – Dennis Jaheruddin Jun 20 '16 at 15:24
The first one, yes I have about 8000 columns and 27m records – user2542275 Jun 20 '16 at 15:28
@user2542275 In that case: are you sure your query is slow? How fast would you expect it to be? [Some things you could easily try are using tez execution, or ORC storage.](http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/). If that does not help, pig may be an option as well. – Dennis Jaheruddin Jun 20 '16 at 15:32
I am already running the query on tez for 2.5 hours. Now since in SAS, same query executes in an hour apprx. I expected it to run in lesser time – user2542275 Jun 20 '16 at 15:34
I think the breaking down of the query seems to work. But I still have no clue, how to increase the number of reducers. – user2542275 Jun 20 '16 at 15:37

Samson Scharfrichter · Answer 2 · 2016-06-20T22:38:53.303

Hadoop is a framework for distributed computing. Some data processing actions are a good fit because they are "embarrassingly parallel". Some data processing actions are a bad fit because they cannot be distributed. Most real-life cases are somewhere in between.

I strongly suspect that what you want to do is get a sample of the raw data with approximately 200k items. But your query requires exactly 200k items.
The simplest way for Hive to do that would be to run the WHERE clause in parallel (451 Mappers on 451+ file blocks) then dump all partial results in a single "sink" (1 Reducer) that lets the first 200k rows to pass through and ignore the rest. But all records will be processed, even the ones to be ignored.

Bottom line: you have a very inefficient sampler, and the result will probably have a strong bias -- smaller file blocks will be Mapped faster and processed earlier by the Reducer, hence larger file blocks have almost no chance to be represented in the sample.

I guess you know how many records match the WHERE clause, so you would be better off with some kind of random sampling that retrieves approx. 500K or 1M records -- that can be done up front, inside each Mapper -- then a second query with the LIMIT if you really want an arbitrary number of records -- a single Reducer will be OK for this kind of smallish volume.

What about `SELECT * FROM (SELECT * FROM wtf WHERE target=0 AND rand() <1.2*200/99999) x LIMIT 200000` assuming that you have about 99999K records with `target=0`? — Samson Scharfrichter, Jun 21 '16 at 16:36

score 0 · Answer 3 · answered Jun 22 '16 at 08:20

Ok. this is what worked for me. now taking only 2-5 minutes for about 27m records:

create table mb.ref201501_nontarget as SELECT * FROM adv.raf_201501 TABLESAMPLE(0.02 PERCENT) where target=0;

When using limit or rand(), it uses at least 1 reducers and the process takes more than 2 hours and kinda freezes at 33% reducing step.

In Tablesample without limit it assigned only 1 mapper and 0 reducer.

How to modify this hive query to increase number of reducers allocated?

3 Answers3