Apache kylin: cube creation fails at step 5 - KeyValue size too large

Question

I started useing Apache kylin (version 1.5.3). When creating a cube I get an error at Step 5 'Save Cuboid Statistics'. The log says:

java.lang.IllegalArgumentException: KeyValue size too large
at org.apache.hadoop.hbase.client.HTable.validatePut(HTable.java:1521)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.validatePut(BufferedMutatorImpl.java:147)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.doMutate(BufferedMutatorImpl.java:134)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:98)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1038)
at org.apache.kylin.storage.hbase.HBaseResourceStore.putResourceImpl(HBaseResourceStore.java:242)
at org.apache.kylin.common.persistence.ResourceStore.putResource(ResourceStore.java:208)
at org.apache.kylin.engine.mr.steps.SaveStatisticsStep.doWork(SaveStatisticsStep.java:113)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:112)
at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:57)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:112)
at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:127)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

First I tried to create the same cube with less dimension and it works. Creating antoher cube with with the left out dimensions works also. But when I try to create one cube with all those (13) dimensions it fails. I also tired to set hbase.client.keyvalue.maxsize to 0 to disable the check. Still the same error.

Does anyone know what the problem is and how I can solve it?

I use kylin on top of Sandbox HDP 2.4 by the way.

Thanks for help in advance

Søren

What's the "hbase.client.keyvalue.maxsize" in your hbase configuration? — Nithin, Aug 10 '16 at 21:10
"hbase.client.keyvalue.maxsize" is set to 0 atm. So normally the check should be disabled. — Søren, Aug 11 '16 at 09:40

Nithin · Answer 1 · 2016-08-16T19:40:14.647

Make sure that value of kylin.hbase.client.keyvalue.maxsize (which resides in kylin config file - conf/kylin.properteis) and hbase.client.keyvalue.maxsize (which resides in hbase config file) are same. Usually we get Key value size too large error when value of kylin.hbase.client.keyvalue.maxsize greater than hbase.client.keyvalue.maxsize

Please find below the sample kylin properties

# kylin server's mode
kylin.server.mode=all

# optional information for the owner of kylin platform, it can be your team's email
# currently it will be attached to each kylin's htable attribute
kylin.owner=whoami@kylin.apache.org

# List of web servers in use, this enables one web server instance to sync up with other servers.
kylin.rest.servers=localhost:7070

# The metadata store in hbase
kylin.metadata.url=kylin_metadata@hbase

# The storage for final cube file in hbase
kylin.storage.url=hbase

# Temp folder in hdfs, make sure user has the right access to the hdfs directory
kylin.hdfs.working.dir=/kylin

# HBase Cluster FileSystem, which serving hbase, format as hdfs://hbase-cluster:8020
# leave empty if hbase running on same cluster with hive and mapreduce
kylin.hbase.cluster.fs=

kylin.job.mapreduce.default.reduce.input.mb=500

# max job retry on error, default 0: no retry
kylin.job.retry=0

# If true, job engine will not assume that hadoop CLI reside on the same server as it self
# you will have to specify kylin.job.remote.cli.hostname, kylin.job.remote.cli.username and kylin.job.remote.cli.password
# It should not be set to "true" unless you're NOT running Kylin.sh on a hadoop client machine 
# (Thus kylin instance has to ssh to another real hadoop client machine to execute hbase,hive,hadoop commands)
kylin.job.run.as.remote.cmd=false

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.hostname=

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.username=

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.password=

# Used by test cases to prepare synthetic data for sample cube
kylin.job.remote.cli.working.dir=/tmp/kylin

# Max count of concurrent jobs running
kylin.job.concurrent.max.limit=10

# Time interval to check hadoop job status
kylin.job.yarn.app.rest.check.interval.seconds=10

# Hive database name for putting the intermediate flat tables
kylin.job.hive.database.for.intermediatetable=default

#default compression codec for htable,snappy,lzo,gzip,lz4
kylin.hbase.default.compression.codec=snappy

#the percentage of the sampling, default 100%
kylin.job.cubing.inmem.sampling.percent=100

# The cut size for hbase region, in GB.
kylin.hbase.region.cut=5

# The hfile size of GB, smaller hfile leading to the converting hfile MR has more reducers and be faster
# set 0 to disable this optimization
kylin.hbase.hfile.size.gb=2

# Enable/disable ACL check for cube query
kylin.query.security.enabled=true

# whether get job status from resource manager with kerberos authentication
kylin.job.status.with.kerberos=false


## kylin security configurations

# spring security profile, options: testing, ldap, saml
# with "testing" profile, user can use pre-defined name/pwd like KYLIN/ADMIN to login
kylin.security.profile=testing

# default roles and admin roles in LDAP, for ldap and saml
acl.defaultRole=ROLE_ANALYST,ROLE_MODELER
acl.adminRole=ROLE_ADMIN

#LDAP authentication configuration
ldap.server=ldap://ldap_server:389
ldap.username=
ldap.password=

#LDAP user account directory; 
ldap.user.searchBase=
ldap.user.searchPattern=
ldap.user.groupSearchBase=

#LDAP service account directory
ldap.service.searchBase=
ldap.service.searchPattern=
ldap.service.groupSearchBase=

#SAML configurations for SSO
# SAML IDP metadata file location
saml.metadata.file=classpath:sso_metadata.xml
saml.metadata.entityBaseURL=https://hostname/kylin
saml.context.scheme=https
saml.context.serverName=hostname
saml.context.serverPort=443
saml.context.contextPath=/kylin


ganglia.group=
ganglia.port=8664

## Config for mail service

# If true, will send email notification;
mail.enabled=false
mail.host=
mail.username=
mail.password=
mail.sender=

###########################config info for web#######################

#help info ,format{name|displayName|link} ,optional
kylin.web.help.length=4
kylin.web.help.0=start|Getting Started|
kylin.web.help.1=odbc|ODBC Driver|
kylin.web.help.2=tableau|Tableau Guide|
kylin.web.help.3=onboard|Cube Design Tutorial|

#guide user how to build streaming cube
kylin.web.streaming.guide=http://kylin.apache.org/

#hadoop url link ,optional
kylin.web.hadoop=
#job diagnostic url link ,optional
kylin.web.diagnostic=
#contact mail on web page ,optional
kylin.web.contact_mail=

###########################config info for front#######################

#env DEV|QA|PROD
deploy.env=QA

###########################deprecated configs#######################
kylin.sandbox=true
kylin.web.hive.limit=20
# The cut size for hbase region,
#in GB.
# E.g, for cube whose capacity be marked as "SMALL", split region per 5GB by default
kylin.hbase.region.cut.small=5
kylin.hbase.region.cut.medium=10
kylin.hbase.region.cut.large=50
kylin.hbase.client.keyvalue.maxsize=1048576

Inside properties set kylin.hbase.client.keyvalue.maxsize=1048576

score 0 · Answer 2 · answered Aug 11 '16 at 09:56

@ Nithin K Anil

Can't find kylin.hbase.client.keyvalue.maxsize in kylin.properties. Kylin.properties looks like this:

> [root@sandbox conf]# cat kylin.properties
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# kylin server's mode
kylin.server.mode=all

# optional information for the owner of kylin platform, it can be your team's email
# currently it will be attached to each kylin's htable attribute
kylin.owner=whoami@kylin.apache.org

# List of web servers in use, this enables one web server instance to sync up with other servers.
kylin.rest.servers=localhost:7070

# The metadata store in hbase
kylin.metadata.url=kylin_metadata@hbase

# The storage for final cube file in hbase
kylin.storage.url=hbase

# Temp folder in hdfs, make sure user has the right access to the hdfs directory
kylin.hdfs.working.dir=/kylin

# HBase Cluster FileSystem, which serving hbase, format as hdfs://hbase-cluster:8020
# leave empty if hbase running on same cluster with hive and mapreduce
kylin.hbase.cluster.fs=

kylin.job.mapreduce.default.reduce.input.mb=500

# max job retry on error, default 0: no retry
kylin.job.retry=0

# If true, job engine will not assume that hadoop CLI reside on the same server as it self
# you will have to specify kylin.job.remote.cli.hostname, kylin.job.remote.cli.username and kylin.job.remote.cli.password
# It should not be set to "true" unless you're NOT running Kylin.sh on a hadoop client machine
# (Thus kylin instance has to ssh to another real hadoop client machine to execute hbase,hive,hadoop commands)
kylin.job.run.as.remote.cmd=false

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.hostname=

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.username=

# Only necessary when kylin.job.run.as.remote.cmd=true
kylin.job.remote.cli.password=

# Used by test cases to prepare synthetic data for sample cube
kylin.job.remote.cli.working.dir=/tmp/kylin

# Max count of concurrent jobs running
kylin.job.concurrent.max.limit=10

# Time interval to check hadoop job status
kylin.job.yarn.app.rest.check.interval.seconds=10

# Hive database name for putting the intermediate flat tables
kylin.job.hive.database.for.intermediatetable=default

#default compression codec for htable,snappy,lzo,gzip,lz4
kylin.hbase.default.compression.codec=snappy

#the percentage of the sampling, default 100%
kylin.job.cubing.inmem.sampling.percent=100

# The cut size for hbase region, in GB.
kylin.hbase.region.cut=5

# The hfile size of GB, smaller hfile leading to the converting hfile MR has more reducers and be faster
# set 0 to disable this optimization
kylin.hbase.hfile.size.gb=2

# Enable/disable ACL check for cube query
kylin.query.security.enabled=true

# whether get job status from resource manager with kerberos authentication
kylin.job.status.with.kerberos=false


## kylin security configurations

# spring security profile, options: testing, ldap, saml
# with "testing" profile, user can use pre-defined name/pwd like KYLIN/ADMIN to login
kylin.security.profile=testing

# default roles and admin roles in LDAP, for ldap and saml
acl.defaultRole=ROLE_ANALYST,ROLE_MODELER
acl.adminRole=ROLE_ADMIN

#LDAP authentication configuration
ldap.server=ldap://ldap_server:389
ldap.username=
ldap.password=

#LDAP user account directory;
ldap.user.searchBase=
ldap.user.searchPattern=
ldap.user.groupSearchBase=

#LDAP service account directory
ldap.service.searchBase=
ldap.service.searchPattern=
ldap.service.groupSearchBase=

#SAML configurations for SSO
# SAML IDP metadata file location
saml.metadata.file=classpath:sso_metadata.xml
saml.metadata.entityBaseURL=https://hostname/kylin
saml.context.scheme=https
saml.context.serverName=hostname
saml.context.serverPort=443
saml.context.contextPath=/kylin


ganglia.group=
ganglia.port=8664

## Config for mail service

# If true, will send email notification;
mail.enabled=false
mail.host=
mail.username=
mail.password=
mail.sender=

###########################config info for web#######################

#help info ,format{name|displayName|link} ,optional
kylin.web.help.length=4
kylin.web.help.0=start|Getting Started|
kylin.web.help.1=odbc|ODBC Driver|
kylin.web.help.2=tableau|Tableau Guide|
kylin.web.help.3=onboard|Cube Design Tutorial|

#guide user how to build streaming cube
kylin.web.streaming.guide=http://kylin.apache.org/

#hadoop url link ,optional
kylin.web.hadoop=
#job diagnostic url link ,optional
kylin.web.diagnostic=
#contact mail on web page ,optional
kylin.web.contact_mail=

###########################config info for front#######################

#env DEV|QA|PROD
deploy.env=QA

###########################deprecated configs#######################
kylin.sandbox=true
kylin.web.hive.limit=20
# The cut size for hbase region,
#in GB.
# E.g, for cube whose capacity be marked as "SMALL", split region per 5GB by default
kylin.hbase.region.cut.small=5
kylin.hbase.region.cut.medium=10
kylin.hbase.region.cut.large=50

Set kylin.hbase.client.keyvalue.maxsize=1048576 in property file — Nithin, Aug 16 '16 at 19:41

score 0 · Answer 3 · answered Aug 11 '16 at 19:43

0

We have hit key limits at Splice Machine as well before...

Also remember from the KeyValue spec the Key is required to fit into a short. KeyValue#getRowOffset()

answered Aug 11 '16 at 19:43

John Leach

518
1
3
9

Apache kylin: cube creation fails at step 5 - KeyValue size too large

3 Answers3