Call SparkSession in worker (Spark-SQL, Java)

Question

I'm working with GraphX and SparkSQL and I'm trying to create DataFrame (Dataset) in a graph node. To create a DataFrame I need the SparkSession (spark.createDataFrame(rows,schema)).All I try, I get an error. This is my Code:

SparkSession spark = SparkSession.builder()
            .master("spark://home:7077")
            .appName("testgraph")
            .getOrCreate();

JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

//read tree File
JavaRDD<String> tree_file = sc.textFile(args[1]);

JavaPairRDD<String[],Long> node_pair = tree_file.map(l-> l.split(" ")).zipWithIndex();

//Create vertex
RDD<Tuple2<Object, Tuple2<Dataset<Row>,Clauses>>> verteces = node_pair.map(t-> {

    List<StructField> fields = new ArrayList<StructField>();
    List<Row> rows = new ArrayList<>();
    String[] vars = Arrays.copyOfRange(t._1(), 2,t._1().length);

    for (int i = 0; i < vars.length; i++) {
       fields.add(DataTypes.createStructField(vars[i], DataTypes.BooleanType, true));
    }
    StructType schema = DataTypes.createStructType(fields);

    Dataset<Row> ds = spark.createDataFrame(rows,schema);
    return new Tuple2<>((Object)(t._2+1),ds);

}).rdd();

This is the Error I'm getting:

16/08/23 15:25:36 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, 192.168.1.5): java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:328)
at Main.lambda$main$e7daa47c$1(Main.java:62)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1028)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I also tried to get the session inside map() with:

SparkSession ss = SparkSession.builder()
                .master("spark://home:7077")
                .appName("testgraph")
                .getOrCreate();

I also get a Error:

16/08/23 15:00:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 7, 192.168.1.5): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I hope someone can help me. I cant find a solution. THANKS!

There is nothing unexpected here. Spark doesn't support operations like this. — zero323, Aug 23 '16 at 13:34
I forgot to tell that I'm using Spark 2.0.0. Is there no solution for this problem? — Vitali D., Aug 23 '16 at 14:03
@zero323 is right! This operation will never work. You can't call a dataFrame or an RDD inside of map method. — Thiago Baldim, Aug 23 '16 at 16:56
OKay. Thanks for your answers. Is there a table like data stuctur I can use? — Vitali D., Aug 23 '16 at 18:30
Could you clarify why you're trying to build a DataFrame within the context of graph? Could you utilize GraphFrames (https://github.com/graphframes/graphframes) to solve this problem? — Denny Lee, Aug 24 '16 at 04:36
On every graphnode, I need a particular table and it would be nice if i could parallelize the table respectively the rows in that. I also want to run pregel on that graph. A java framework or datastructur would be nice. Or I could write my own table-stuctur with Lists. — Vitali D., Aug 24 '16 at 12:10

Call SparkSession in worker (Spark-SQL, Java)

0 Answers0