I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?