1

I am new to hive and hadoop and just created a table (orc fileformat) on Hive. I am now trying to create indexes on my hive table (bitmap index). Every time I run the index build query, hive starts a map reduce job to index. At some point my map reduce job just hangs and one of my nodes (randomly different across multiple retries so its probably not the node) fails. I tried increasing my mapreduce.child.java.opts to 2048mb but that was giving me errors with using up more memory than available so I increased, mapreduce.map.memory.mb and mapreduce.reduce.memory.mb to 8GB. All other configurations are left to the defaults.

Any help with what configurations I am missing out would be really appreciated.

Just for context, I am trying to index a table with 2.4 Billion rows, which is 450GB in size and has 3 partitions.

Vineet Goel
  • 2,138
  • 1
  • 22
  • 28

1 Answers1

2

First, please confirm, if the indexing worked for data at small scale. Assuming it is done, the way the map reduce jobs are run by Hive, depends on many issues. 1. Type of queries(using count(*) or just Select *). 2. Also, the amount of memory a reducer is allocated during the execution phase.(This is controlled by hive.exec.reducers.bytes.per.reducer property).

In your care it can be second point. Give the scale at which you are running your program, please calculated the memory requirements accordingly. This post has more information. Happy learning and coding

Community
  • 1
  • 1
Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • Looks like since I am not changing that value at all, the current `hive.exec.reducers.bytes.per.reducer` should be set at 1GB. Shouldn't that not give me errors given that 1GB is totally manageable with my yarn mapreduce configurations? It should just be spawning a lot more reducers. Do you suggest decreasing the number of bytes allocated to every reducer? I am sorry but I couldn't really understand how to calculate the memory requirements from the post you gave a link to. – Vineet Goel Jun 09 '15 at 17:51
  • When you say, that map reduce job hangs on, there may be other reasons. If that problem still persists, use debug statements(may be use log4j or others and add debug stmts) to identigy which line of code is having an issue. And regarding memory considerations, I was refering to your 2.4 billion rows execution. Once the programs runs at small, scale then go higher taking memory into account. I know its difficult to analyze before, please move ahead step by step. – Ramzy Jun 09 '15 at 18:30