5

WRT a 3 node cluster composed of c3.2xlarge instances.

I have two tables. Table U has about 65 Million records and contains among other fields latitude and longitude. Table L has about 1 million records and also contains latitude and longitude.

U is stored as an ORC table.

Task is to determine how many U records fall within a 10 mile radius of locations in L.

select l.id, count(u.id) from U u, L l where 3960 *
acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.long) -
radians(u.long)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
group by l.id;

The bit 3960 * acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.long) - radians(u.long)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0 is just the distance between the lat/long pairs should be less than 10 miles constraint.

Problem: The query seems to take forever to complete. While the map phase completes relatively quickly the reduce phase gets stuck at some fixed percentage (80% ish)

I noticed this in the output messages that Hive emits. Number of reduce tasks determined at compile time: 1

I tried increasing the number of reducers by setting mapred.reduce.tasks to 7 but it always ends up as 1. I was not successful in increasing the number of reducers.

This answer seems to suggest that perhaps if I write my query in a different way then I can force more than 1 reducer. But I haven't been able to do that yet.

Execution Time estimates :For a single location in L it takes about 60 seconds to get an answer. By that account it should take 60 million seconds which is around 700 days! Should it take so much time? Even for Hadoop.

I have also tried putting addional constraints like bounding the lat, long in a 10 mile by 10 mile square box with the location in L at the centre of the box but the time taken is now 40 seconds for 1 location which is not a vast improvement.

Questions:

1) How can I force more number of reducers? 2) Is there a better(in terms of execution time) query? 3) Any other advice that will help me solve this problem.

Version: Hadoop - 2.7.0 Java 1.7.0_80 Hive 1.2.1

Community
  • 1
  • 1
awhan
  • 510
  • 6
  • 13
  • I think you should look at the GIS extensions for Hive: https://cwiki.apache.org/confluence/display/Hive/Spatial+queries. – Gordon Linoff Oct 19 '15 at 15:21
  • which version of Hadoop are you using? mapreduce.job.reduces replaces mapreduce.job.reduces in Hadoop 2.x – Ravindra babu Oct 19 '15 at 17:53
  • There is a cross join U x L, then filtering. Cross join multiplies rows. Is it possible to transform WHERE clause to join ON clause? Like this: FROM U u INNER JOIN L l ON (3960 * acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.long) - radians(u.long)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0) GROUP BY l.id – leftjoin Apr 04 '16 at 18:12

2 Answers2

1

As told earlier,

mapred.reduce.tasks replaced with mapreduce.job.reduces

Have a look at all deprecated attributes and new values for those deprecated values from this article.

Apart from deprecated attributes, you have to migrate some of the functions as well as explained in this article

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

If you're using Hadoop 2, the right way to specify reducers is:

SET mapreduce.job.reduces=7;

This is related to changes made in Hadoop 2. I had the same behavior you're describing, updated to the above command, and was successful in setting the reduce tasks. I'm not sure how Hadoop chooses the number of reducers necessary, but it does seem to go small in some situations, and that can cause significant performance lags.

economy
  • 4,035
  • 6
  • 29
  • 37