0

I am studying about bioinformatics and Recently, I used annovar tools(table_annovar.pl) I got data(format csv) that Included chr/Func.refGene/Gene.refGene/ExonicFunc.refGene, etc...

So I tried data handling for counting chr per Variant of Func.refGene, Gene.refGene and ExonicFunc.refGene

I used spark-shell dataframe I started spark-shell --master local[2] --driver-memory 16G --executor-memory 16G --executor-cores 8

My scala code is below

21/08/06 15:38:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/08/06 15:38:46 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Spark context Web UI available at http://kbblogin2-ib.kobic.re.kr:4040
Spark context available as 'sc' (master = local[2], app id = local-1628231927063).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information. 

scala> import spark.implicits._ 
scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file.csv")

scala> df.show(5)
+----+------------+------------+------------------+
| chr|Func_refGene|Gene_refGene|ExonicFunc_refGene|
+----+------------+------------+------------------+
|chr1|  intergenic|NONE;DDX11L1|                 .|
|chr1|  intergenic|NONE;DDX11L1|                 .|
|chr1|  intergenic|NONE;DDX11L1|                 .|
|chr1|  intergenic|NONE;DDX11L1|                 .|
|chr1|  intergenic|NONE;DDX11L1|                 .|
+----+------------+------------+------------------+
only showing top 5 rows

scala> val Func_refGene = df.select("Func_refGene").distinct()
scala> Func_refGene.show()
+--------------------+                                                          
|        Func_refGene|
+--------------------+
|                UTR3|
|            upstream|
|              exonic|
|           UTR5;UTR3|
|                UTR5|
|            splicing|
|            intronic|
|        ncRNA_exonic|
|      ncRNA_intronic|
|ncRNA_exonic;spli...|
|     exonic;splicing|
|          ncRNA_UTR5|
|      ncRNA_splicing|
|          downstream|
| upstream;downstream|
|          intergenic|
+--------------------+

scala> for(variant <- Func_refGene){
       |val df_func = df.filter($"Func_refGene" === variant).groupBy("Chr").count()
       |}
[Stage 14:================================================>         (5 + 1) / 6]21/08/06 15:45:32 ERROR Executor: Exception in task 20.0 in stage 15.0 (TID 413)
java.lang.NullPointerException
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:32)
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
21/08/06 15:45:32 ERROR Executor: Exception in task 16.0 in stage 15.0 (TID 412)
java.lang.NullPointerException
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:32)
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
21/08/06 15:45:32 WARN TaskSetManager: Lost task 20.0 in stage 15.0 (TID 413, localhost, executor driver): java.lang.NullPointerException
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:32)
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

21/08/06 15:45:32 ERROR TaskSetManager: Task 20 in stage 15.0 failed 1 times; aborting job
21/08/06 15:45:32 ERROR Executor: Exception in task 57.0 in stage 15.0 (TID 415)
java.lang.NullPointerException
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:32)
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
21/08/06 15:45:32 ERROR Executor: Exception in task 23.0 in stage 15.0 (TID 414)
java.lang.NullPointerException
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:32)
    at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 15.0 failed 1 times, most recent failure: Lost task 20.0 in stage 15.0 (TID 413, localhost, executor driver): java.lang.NullPointerException
    at $anonfun$1.apply(<console>:32)
    at $anonfun$1.apply(<console>:31)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1925)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1913)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1912)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1912)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:948)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2146)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2095)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2084)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:759)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:972)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:970)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
  at org.apache.spark.rdd.RDD.foreach(RDD.scala:970)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3355)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
  at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3351)
  at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2721)
  ... 53 elided
Caused by: java.lang.NullPointerException
  at $anonfun$1.apply(<console>:32)
  at $anonfun$1.apply(<console>:31)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
  at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:123)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

Error was occurred! I think it's a memory problem. Also, I'm not confident in my code. So anyone help me? please fix my code or executor memory options

Thank your comment! I put additional description to my question. I run spark code (for example)

df.filter($"Func_refGene" === "intergenic").groupBy("Chr").count().show(5)
+-----+------+                                                                  
|  Chr| count|
+-----+------+
|chr12|513449|
|chr20|301930|
| chr9|547974|
|chr10|545868|
| chr4|831798|
+-----+------+

"intergenic" is variable in "Func_refGene" columns of dataframe named df. and I want to loop this process. So, I will wish to get "chr | count" result about all variable in "Func_refGene" column.

plus> My linux server CPU and memory info. (linux: centos)

  • grep -c processor /proc/cpuinfo 24

  • grep "physical id" /proc/cpuinfo | sort -u | wc -l 2

  • free total memory = 131414124 KB = 125 GB

DongYoon
  • 1
  • 2
  • A null pointer exception means that you are doing an operation which needs a `non-null value` on a `null` value. This is definitely your code issue. – sarveshseri Aug 06 '21 at 10:08
  • It is the problem with your code. Func_refGene is a dataframe variable and you cannot run for loop on it in this way. There are other ways of doing that but you would have to update the questions with what you are trying to do and what is the output that you are looking for? – Nikunj Kakadiya Aug 06 '21 at 14:11
  • @NikunjKakadiya Thank your comment. I wanted to count chr (in "Chr" column) in the following conditions that count chr of variable in columns("Func_refGene"/"Gene_refGene"/"ExonicFunc_refGene") in dataframe named a "df". For example, I run spark code : df.filter($"Func_refGene" === "intergenic").groupBy("Chr").count() The result of this code is another dataframe that include header have Chr(key) count(value) and rows have Chr12 | 513449 / Chr20 | 301930 – DongYoon Aug 07 '21 at 07:27
  • @sarveshseri I'll refer to your comments. Thank you! – DongYoon Aug 07 '21 at 07:37

1 Answers1

0

answer on my question :)

val Func_array = df.select($"Func_refGene").distinct().collect()

val size = (0 to Func_array.size-1).toList

for(number <- size){
    println(df.filter($"Func_refGene" === Func_array(number).toString.replaceAll("[\\[\\]]","")).groupBy("Chr").count())
}

The "code" above is the "code" I wanted!

DongYoon
  • 1
  • 2