We have a four-datanodes-cluster running CDH5.0.2, installed through Cloudera Manager parcels. In order to import 13M users' rows into HBase, we wrote a simple Python script and used hadoop-streaming jar. It works as expected up to 100k rows. And then... then, one after the other, all datanodes crash with the same message:
The health test result for REGION_SERVER_GC_DURATION has become bad:
Average time spent in garbage collection was 44.8 second(s) (74.60%)
per minute over the previous 5 minute(s).
Critical threshold: 60.00%.
Any attempt to solve the issue following the advices found around the web (e.g. [1], [2], [3]) do not lead anywhere near a solution. "Playing" with java heap size is useless. The only thing which "solved" the situation was increasing Garbage Collection Duration Monitoring Period for region servers from 5' to 50'. Arguably a dirty workaround.
We don't have the workforce to create a monitor for our GC usage right now. We eventually will, but I was wondering how possibly importing 13M rows into HBase could lead to a sure crash of all region servers. Is there a clean solution?
Edit:
JVM Options on Datanodes are:
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
Datanodes are physical machines running CentOS 6.5, each with 32Gb Ram and 1Quadcore at 2GHz with 30Mb cache.
Below excerpt of the Python script which we run. We fill two tables: one with a unique user ID as rowkey and a single columnfamily with users' info, another with all info we might want to access as rowkey.
#!/usr/bin/env python2.7
import sys
import happybase
import json
connection = happybase.Connection(host=master_ip)
hbase_main_table = connection.table('users_table')
hbase_index_table = connection.table('users_index_table')
header = ['ID', 'COL1', 'COL2', 'COL3', 'COL4']
for line in sys.stdin:
l = line.replace('"','').strip("\n").split("\t")
if l[header.index("ID")] == "ID":
#you are reading the header
continue
for h in header[1:]:
try:
id = str(l[header.index("ID")])
col = 'info:' + h.lower()
val = l[header.index(h)].strip()
hbase_table.put(id_au_bytes, {
col: val
})
indexed = ['COL3', 'COL4']
for typ in indexed:
idx = l[header.index(typ)].strip()
if len(idx) == 0:
continue
row = hbase_index_table.row(idx)
old_ids = row.get('d:s')
if old_ids is not None:
ids = json.dumps(list(set(json.loads(old_ids)).union([id_au])))
else:
ids = json.dumps([id_au])
hbase_index.put(idx, {
'd:s': ids,
'd:t': typ,
'd:b': 'ame'
})
except:
msg = 'ERROR '+str(l[header.index("ID")])
logging.info(msg, exc_info=True)