0

I am trying to execute this query using hive, but it takes forever to run, especially after going to the reducer step. It say mappers:451, reducers:1.

create table mb.ref201501_nontarget as select * from adv.raf_201501  where target=0 limit 200000;

My motivation to change the query came from this answer:

Hive unable to manually set number of reducers

I tried changing the query to:

create table mb.ref201501_nontarget as select * from (select * from adv.raf_201501 limit 200000)  where target=0;

but its throwing error.

Community
  • 1
  • 1
user2542275
  • 137
  • 1
  • 7

3 Answers3

0

This question is very vague, if you think the last query produces the proper result (note that it is not the same as the first one!!) this should do the trick:

create table mytmptbl = select * from advanl.raf_201501 limit 200000;
create table mbansa001c.ref201501_nontarget as select * from (mytmptbl )  where target=0;

After which you probably want to delete the temporary table again.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • Thanks for quick response, why you find the question vague? – user2542275 Jun 20 '16 at 15:07
  • @user2542275 Because your question has a title, and a body, but from a simple read it is not that clear how they connect. – Dennis Jaheruddin Jun 20 '16 at 15:10
  • This still does not seem to make much improvement to the run time, and it gets stuck at reduce 33% and now running really slowly. I tried setting set mapred.reduce.tasks=120; But its still assigning reducers:1. How can I increase the reducers? – user2542275 Jun 20 '16 at 15:19
  • @user2542275 Which one gets stuck? The first query or the second one? Is your table by any chance very wide (100+ columns), because these are very simple querys that would normally not take more than a few minutes. – Dennis Jaheruddin Jun 20 '16 at 15:24
  • The first one, yes I have about 8000 columns and 27m records – user2542275 Jun 20 '16 at 15:28
  • @user2542275 In that case: are you sure your query is slow? How fast would you expect it to be? [Some things you could easily try are using tez execution, or ORC storage.](http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/). If that does not help, pig may be an option as well. – Dennis Jaheruddin Jun 20 '16 at 15:32
  • I am already running the query on tez for 2.5 hours. Now since in SAS, same query executes in an hour apprx. I expected it to run in lesser time – user2542275 Jun 20 '16 at 15:34
  • I think the breaking down of the query seems to work. But I still have no clue, how to increase the number of reducers. – user2542275 Jun 20 '16 at 15:37
0

Hadoop is a framework for distributed computing. Some data processing actions are a good fit because they are "embarrassingly parallel". Some data processing actions are a bad fit because they cannot be distributed. Most real-life cases are somewhere in between.

I strongly suspect that what you want to do is get a sample of the raw data with approximately 200k items. But your query requires exactly 200k items.
The simplest way for Hive to do that would be to run the WHERE clause in parallel (451 Mappers on 451+ file blocks) then dump all partial results in a single "sink" (1 Reducer) that lets the first 200k rows to pass through and ignore the rest. But all records will be processed, even the ones to be ignored.

Bottom line: you have a very inefficient sampler, and the result will probably have a strong bias -- smaller file blocks will be Mapped faster and processed earlier by the Reducer, hence larger file blocks have almost no chance to be represented in the sample.

I guess you know how many records match the WHERE clause, so you would be better off with some kind of random sampling that retrieves approx. 500K or 1M records -- that can be done up front, inside each Mapper -- then a second query with the LIMIT if you really want an arbitrary number of records -- a single Reducer will be OK for this kind of smallish volume.

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
0

Ok. this is what worked for me. now taking only 2-5 minutes for about 27m records:

create table mb.ref201501_nontarget as SELECT * FROM adv.raf_201501 TABLESAMPLE(0.02 PERCENT) where target=0;

When using limit or rand(), it uses at least 1 reducers and the process takes more than 2 hours and kinda freezes at 33% reducing step.

In Tablesample without limit it assigned only 1 mapper and 0 reducer.

user2542275
  • 137
  • 1
  • 7