0

Spark job fetches the data from hbase and ingests the data to snappydata 1.1.0. Spark which is packaged with Snappydata 1.1.0 is launched as standalone cluster (snappy and spark share the cluster) and jobs are submitted to the Spark via spark restAPI.

Snappydata 1.1.0 cluster will be stable for a week for so. Once; number of columnar tables reaches 20-30; ingestion job fails with below mentioned Exception. Total resources used does not reach 50%. At the peak; each table could be of size 10GB (1 Billion rows and 25 columns).

Exception details: Caused by: java.sql.SQLException: (SQLState=40XL1 Severity=30000) (Server=sw4/10.49.2.117[1527] Thread=ThriftProcessor-57) A lock could not be obtained within the time requested at io.snappydata.thrift.SnappyDataService$executeUpdate_result$executeUpdate_resultStandardScheme.read(SnappyDataService.java:8244) at io.snappydata.thrift.SnappyDataService$executeUpdate_result$executeUpdate_resultStandardScheme.read(SnappyDataService.java:8221) at io.snappydata.thrift.SnappyDataService$executeUpdate_result.read(SnappyDataService.java:8160) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86) at io.snappydata.thrift.SnappyDataService$Client.recv_executeUpdate(SnappyDataService.java:285) at io.snappydata.thrift.SnappyDataService$Client.executeUpdate(SnappyDataService.java:269) at io.snappydata.thrift.internal.ClientService.executeUpdate(ClientService.java:976) at io.snappydata.thrift.internal.ClientStatement.executeUpdate(ClientStatement.java:687) at io.snappydata.thrift.internal.ClientStatement.executeUpdate(ClientStatement.java:221) at org.apache.spark.sql.sources.JdbcExtendedUtils$.executeUpdate(jdbcExtensions.scala:84) at org.apache.spark.sql.execution.columnar.impl.BaseColumnFormatRelation.createActualTables(ColumnFormatRelation.scala:376) at org.apache.spark.sql.sources.NativeTableRowLevelSecurityRelation$class.createTable(interfaces.scala:444) at org.apache.spark.sql.execution.columnar.JDBCAppendableRelation.createTable(JDBCAppendableRelation.scala:46) at org.apache.spark.sql.execution.columnar.impl.DefaultSource.createRelation(DefaultSource.scala:191) at org.apache.spark.sql.execution.columnar.impl.DefaultSource.createRelation(DefaultSource.scala:71) at org.apache.spark.sql.execution.columnar.impl.DefaultSource.createRelation(DefaultSource.scala:41) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:328) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.CodegenSparkFallback$$anonfun$doExecute$1.apply(CodegenSparkFallback.scala:175) at org.apache.spark.sql.execution.CodegenSparkFallback$$anonfun$doExecute$1.apply(CodegenSparkFallback.scala:175) at org.apache.spark.sql.execution.CodegenSparkFallback.executeWithFallback(CodegenSparkFallback.scala:113) at org.apache.spark.sql.execution.CodegenSparkFallback.doExecute(CodegenSparkFallback.scala:175) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.SnappySession.createTableInternal(SnappySession.scala:1259) at org.apache.spark.sql.SnappySession.createTable(SnappySession.scala:990) at com.pw.smp.csa.SuspiciousActivityDetection$.runjob(SuspiciousActivityDetection.scala:318) at com.pw.smp.csa.SuspiciousActivityDetection$.main(SuspiciousActivityDetection.scala:142) at com.pw.smp.csa.SuspiciousActivityDetection.main(SuspiciousActivityDetection.scala) ... 6 more Caused by: java.rmi.ServerException: Server STACK: java.sql.SQLTransactionRollbackException(40XL1): A lock could not be obtained within the time requested at com.pivotal.gemfirexd.internal.iapi.error.StandardException.newException(StandardException.java:456) at com.pivotal.gemfirexd.internal.engine.locks.GfxdLocalLockService.getLockTimeoutException(GfxdLocalLockService.java:295) at com.pivotal.gemfirexd.internal.engine.locks.GfxdDRWLockService.getLockTimeoutException(GfxdDRWLockService.java:727) at com.pivotal.gemfirexd.internal.engine.distributed.utils.GemFireXDUtils.lockObject(GemFireXDUtils.java:1350) at com.pivotal.gemfirexd.internal.impl.sql.catalog.GfxdDataDictionary.lockForWriting(GfxdDataDictionary.java:632) at com.pivotal.gemfirexd.internal.impl.sql.catalog.GfxdDataDictionary.startWriting(GfxdDataDictionary.java:562) at com.pivotal.gemfirexd.internal.impl.sql.catalog.GfxdDataDictionary.startWriting(GfxdDataDictionary.java:507) at com.pivotal.gemfirexd.internal.impl.sql.execute.CreateTableConstantAction.executeConstantAction(CreateTableConstantAction.java:297) at com.pivotal.gemfirexd.internal.impl.sql.execute.MiscResultSet.open(MiscResultSet.java:64) at com.pivotal.gemfirexd.internal.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:593) at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:2179) at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:1289) at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:1006) at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:503) at io.snappydata.thrift.server.SnappyDataServiceImpl.executeUpdate(SnappyDataServiceImpl.java:1794) at io.snappydata.thrift.SnappyDataService$Processor$executeUpdate.getResult(SnappyDataService.java:1535) at io.snappydata.thrift.SnappyDataService$Processor$executeUpdate.getResult(SnappyDataService.java:1519) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at io.snappydata.thrift.server.SnappyDataServiceImpl$Processor.process(SnappyDataServiceImpl.java:201) at io.snappydata.thrift.server.SnappyThriftServerThreadPool$WorkerProcess.run(SnappyThriftServerThreadPool.java:270) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.snappydata.thrift.server.SnappyThriftServer$1.lambda$newThread$0(SnappyThriftServer.java:143) at java.lang.Thread.run(Thread.java:748) Caused by: com.gemstone.gemfire.cache.LockTimeoutException: lock timeout for object: DefaultGfxdLockable@a534854:GfxdDataDictionary, for lock: GfxdReentrantReadWriteLock@77629235,QSync@3630b21a[name=GfxdDataDictionary] [readers=0], requested for owner: DistributedTXLockOwner(member=10.49.2.117(29205):5551,XID=2667,ownerThread=Thread[ThriftProcessor-57,5,SnappyThriftServer Threads],vmCreatorThread=Thread[ThriftProcessor-57,5,SnappyThriftServer Threads]) at com.pivotal.gemfirexd.internal.engine.locks.GfxdLocalLockService.getLockTimeoutRuntimeException(GfxdLocalLockService.java:290) at com.pivotal.gemfirexd.internal.engine.locks.GfxdLocalLockService.getLockTimeoutException(GfxdLocalLockService.java:296) ... 22 more

at io.snappydata.thrift.common.ThriftExceptionUtil.newSQLException(ThriftExceptionUtil.java:109)
at io.snappydata.thrift.internal.ClientStatement.executeUpdate(ClientStatement.java:696)
... 42 more
  • Is your ingestion code doing continuous table drop/create? The lock timeout failure above is being seen in a "create table" so not clear why this is being fired even after a week or so of operation. Normally table creation should be done once and ingestion will simply insert/put into the table. Please paste the relevant job code. – Sumedh Jul 19 '19 at 05:32
  • Nope. Workflow is as below:A spark job is launched which creates the table and pushes the data. Once ingestion process is completed ONLY, consumers will start querying those table. Dataframe mechanism is used to create and ingest data. Once table is created its never updated. Only operations that happens on table is "select" and finally "drop". – Karthik S Raj Jul 19 '19 at 06:51
  • The lock timeout exception is being seen during CREATE TABLE. Since you mentioned that cluster worked fine for a week or so, hence wondering why would a CREATE TABLE be fired after a week of operation. Normally all tables should be created once upfront and only inserts/updates/deletes done on a continuous basis. In your case of ingestion it would be better to TRUNCATE and repopulate the table instead of dropping and creating again. – Sumedh Jul 20 '19 at 07:27
  • Table created is linked to the session (logical) that user works on, session can be active for a week or more, hence table cannot be re-used. New table needs to be created as and when new session is launched, this table contains rows for session. Assuming we re-use table names, this issue will still occur, as number of tables is linked to number of active sessions. – Karthik S Raj Jul 22 '19 at 06:22
  • A lock timeout in table creation on the DataDictionary usually means there is some other operation that is holding an open ResultSet on the table, or there is a DROP operation that is stuck for some reason. The logs should have the lock dumps and should indicate the threads holding the locks. You can post the logs on the Slack channel to get help on this. – Sumedh Jul 23 '19 at 08:34

1 Answers1

0

Looks like your App is trying to create a table when the data dictionary is locked. Is your app concurrently doing other work?

jagsr
  • 535
  • 2
  • 6
  • Spark job launched creates the table and pushes the data.Once ingestion process is completed ONLY, consumers will start querying those table.Dataframe mechanism is used to create and ingest data. Once table is created its never updated.Only operation that happens on table is select and finally drop.Above mentioned process works fine for week or so.Then this happens.Post that Snappydata restart is the only solution.Until restart is done DML like table creation row insertion are not allowed.Post that things work fine for a day or two and again the same problem. – Karthik S Raj Jul 19 '19 at 07:07