-1

I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty.

I tried out following two approaches. With both I am getting same exception.

Approach 1: Build dataset using sparkSession.createDataframe().

System.out.println("RDD Count: " + rdd.count());
        Dataset<Row> rows = applicationSession
                .getSparkSession().createDataFrame(rdd,  data.getSchema()).toDF(data.convertListToSeq(data.getColumnNames()));
        rows.createOrReplaceTempView(createStagingTableName(sparkTableName));
        rows.show();
        rows.printSchema();

Approach 2: Use Hive Context to create dataset.

System.out.println("RDD Count: " + rdd.count());
    System.out.println("Create view using HiveContext..");
    Dataset<Row> rows = applicationSession.gethiveContext().applySchema(rdd, data.getSchema());

I am able to print schema for above dataset using both apporaches. Not sure what exactly causing null pointer exception.

Show() method internally invokes take() method which is throwing null pointer exception. But why this dataset is populated as NULL? if RDD contains values then it should not be null.

This is a strange behaviour.

Below are logs for the same.

RDD Count: 35

Also I am able to run above code in local mode without any exception it is working fine.

As soon as I deploy this code on Yarn I start getting following exception.

I am able to create dataframe even I am able to register view for the same. As soon as I perfrom rows.show() or rows.count() operation on this dataset I am getting following error.

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
    at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Am I doing anything wrong here? Please suggest.

Chetan Shirke
  • 896
  • 4
  • 13
  • 35

1 Answers1

-1

Can you post the schema for dataframe? Issue is with schema string you are using and separator that you are using to split the schema string.

Mugdha
  • 112
  • 9
  • Reference data schema : test_table : StructType(StructField(test_table.column1,StringType,false), StructField(test_table.column2,StringType,false), StructField(test_table.column3,StringType,false), StructField(test_table.column4,StringType,false), StructField(test_table.column5,StringType,false)) Reference data size : test_table : 35 root |-- column1: string (nullable = true) |-- column2: string (nullable = true) |-- column3: string (nullable = true) |-- column4: string (nullable = true) |-- column5: string (nullable = true) – Chetan Shirke May 18 '18 at 07:01
  • it is working fine on local mode. Not sure why exactly it is retruning empty dataframe on yarn. – Chetan Shirke May 18 '18 at 07:03