4

We have a four-datanodes-cluster running CDH5.0.2, installed through Cloudera Manager parcels. In order to import 13M users' rows into HBase, we wrote a simple Python script and used hadoop-streaming jar. It works as expected up to 100k rows. And then... then, one after the other, all datanodes crash with the same message:

The health test result for REGION_SERVER_GC_DURATION  has become bad: 
Average time spent in garbage collection was 44.8 second(s) (74.60%) 
per minute over the previous 5 minute(s). 
Critical threshold: 60.00%.

Any attempt to solve the issue following the advices found around the web (e.g. [1], [2], [3]) do not lead anywhere near a solution. "Playing" with java heap size is useless. The only thing which "solved" the situation was increasing Garbage Collection Duration Monitoring Period for region servers from 5' to 50'. Arguably a dirty workaround.

We don't have the workforce to create a monitor for our GC usage right now. We eventually will, but I was wondering how possibly importing 13M rows into HBase could lead to a sure crash of all region servers. Is there a clean solution?

Edit:

JVM Options on Datanodes are:

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled

Datanodes are physical machines running CentOS 6.5, each with 32Gb Ram and 1Quadcore at 2GHz with 30Mb cache.

Below excerpt of the Python script which we run. We fill two tables: one with a unique user ID as rowkey and a single columnfamily with users' info, another with all info we might want to access as rowkey.

#!/usr/bin/env python2.7
import sys
import happybase
import json
connection = happybase.Connection(host=master_ip)
hbase_main_table = connection.table('users_table')
hbase_index_table = connection.table('users_index_table')
header = ['ID', 'COL1', 'COL2', 'COL3', 'COL4']
for line in sys.stdin:
    l = line.replace('"','').strip("\n").split("\t")
    if l[header.index("ID")] == "ID":
        #you are reading the header
        continue
    for h in header[1:]:
        try:
            id = str(l[header.index("ID")])
            col = 'info:' + h.lower()
            val = l[header.index(h)].strip()
            hbase_table.put(id_au_bytes, {
                    col: val
                    })
            indexed = ['COL3', 'COL4']
            for typ in indexed:
               idx = l[header.index(typ)].strip()
               if len(idx) == 0:
                   continue
               row = hbase_index_table.row(idx)
               old_ids = row.get('d:s')
               if old_ids is not None:
                   ids = json.dumps(list(set(json.loads(old_ids)).union([id_au])))
               else:
                   ids = json.dumps([id_au])
               hbase_index.put(idx, {
                       'd:s': ids,
                       'd:t': typ,
                       'd:b': 'ame'
                       })
       except:
           msg = 'ERROR '+str(l[header.index("ID")])
           logging.info(msg, exc_info=True)
Mario Alemi
  • 1,797
  • 1
  • 13
  • 18
  • What is your current GC setup (please list all JVM params), and hardware (CPU / memory per machine)? – spudone Jul 09 '14 at 23:58
  • 2
    Many times we get GC error because, we keep on creating objects without consuming them. *JVM is more like flow of data.* You create an objects and those in turn will get collected by GC. If you and GC are in SYNC there is no possibility of GC OutOfMemory error. To answer particularly your question, I would suggest do not read next row of data until you have cleared previous data. Can you review your python script to verify it is read enough data which is it can process and not load next row of data until previous is processed? Are you using it in multi-threaded way? Can you share your script? – user3657302 Jul 10 '14 at 20:03
  • I'm sorry –I just saw the comments! I updated the question. @user3657302 As far as I can see the Python script should not be the problem, but please lmk you thoughts. – Mario Alemi Jul 14 '14 at 08:54
  • see this thread http://stackoverflow.com/questions/10109572/gc-overhead-limit-exceeded-on-hadoop-20-datanode – Vikas Hardia Jul 17 '14 at 13:15
  • @VikasHardia thanks, but they are not related (apart from being about GC:) ). Changing the amount of heap does not solve the problem –for the moment, only changing Monitoring Period does.... – Mario Alemi Jul 17 '14 at 16:33
  • FYI the code indentation is wrong. – Daniel Darabos Jul 21 '14 at 10:49
  • I suspect you should try performing your inserts in batches of less then 100k records. – Elliott Frisch Jul 22 '14 at 01:03
  • And actually use happybase batching with `hbase_table.batch(batch_size=10000)`. – kichik Jul 22 '14 at 04:23
  • thanks. any source where it says one should insert in batches? – Mario Alemi Jul 22 '14 at 09:40
  • and how? hadoop-streaming cats the file to the stdout, which is eventually piped into the stdin of the script. as far as I can see, it's hadoop-streaming's task to feed the script with batches of lines, not happybase's... – Mario Alemi Jul 22 '14 at 09:47

1 Answers1

3

One of the major issues that a lot of people are running into these days is that the amount of RAM available to java applications has exploded but most of the information about tuning Java GC is based on experience from the 32-bit era.

I recently spent a good deal of time researching GC for large heap situations in order to avoid the dreaded "long pause". I watched this excellent presentation several time and finally GC and the issues I've faced with it started making more sense.

I don't know that much about Hadoop but I think you may be running into a situation where your young generation is too small. It's unfortunate but most information about JVM GC tuning fails to emphasize that the best place for your objects to be GC'd is in the young generation. It takes literally no time at all to collect garbage at this point. I won't go into the details (watch the presentation if you want to know) but what happens is that is if you don't have enough room in your young (new) generation, it fills up prematurely. This forces a collection and some objects will be moved to the tenured (old) generation. Eventually the tenured generation fills up and it will need to be collected too. If you have a lot of garbage in your tenured generation, this can be very slow as the tenured collection algorithm is generally mark sweep which is has a non-zero time for collecting garbage.

I think you are using Hotspot. Here's a good reference for the various GC arguments for hotspot. JVM GC options

I would start by increasing the size of the young generation greatly. My assumption here is that a lot of short to medium lived objects are being created. What you want to avoid is having these be promoted into the tenured generation. The way you do that is extend the time that they spend in the young generation. To accomplish that, you can either increase it's size (so it takes longer to fill up) or increase the tenuring threshold (essentially the number young collections the object will stay for). The problem with the tenuring threshold is that it takes time to move the object around in the young generation. Increasing the size of the young generation is inefficient in terms of memory but my guess is that you have lots to spare.

I've used this solution with caching servers and I have minor collections in the > 100 ms range and infrequent (less than one a day) major collections generally under 0.5s with a heap around 4GB. Our object live either 5 min, 15 min or 29 days.

Another thing you might want to consider is the G1 (garbage first) collector which was recently added (relatively speaking) to HotSpot.

I'm interested in how well this advice works for you. Good luck.

James Watson
  • 464
  • 2
  • 11
  • Thanks! The project is now frozen, but I'll try your suggestions as soon as I get back to it, and let you know.... – Mario Alemi Jan 28 '15 at 13:21