0

I'm trying run jar file from snappydata cli.

I'm just want to create a sparkSession and SnappyData session on beginning.

package io.test

import org.apache.spark.sql.{SnappySession, SparkSession}

object snappyTest {

  def main(args: Array[String]) {
  val spark: SparkSession = SparkSession
  .builder
  .appName("SparkApp")
  .master("local")
  .getOrCreate

 val snappy = new SnappySession(spark.sparkContext)
 }
}

From sbt file:

name := "SnappyPoc"

version := "0.1"

scalaVersion := "2.11.8"

libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"

When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:

"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",

I have Spark Standalone 2.1.1, SnappyData 1.0.0. I added dependencies to Spark instance.

Could you help me ?. Thank in advanced.

Priyantha
  • 4,839
  • 6
  • 26
  • 46
Tomtom
  • 91
  • 1
  • 1
  • 9

2 Answers2

1

The difference between "embedded" mode and "smart connector" mode needs to be explained first.

Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.

The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.

For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.

Sumedh
  • 383
  • 2
  • 7
0

I think there were missing methods isValidJob and runSnappyJob. When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main

Should be the same in both ?

Tomtom
  • 91
  • 1
  • 1
  • 9