0

I have a program which convert an excel file to a Spark DataFrame and then write this file on our datalake in a compressed ORC format. Note that I am constrained in using Spark 1.6.2 API.

  • Variable sq is a HiveContext
  • Variable schema contains a spark StructType of small size (25ko).
  • Variable excelData contains a java List of Spark Row containing few Mo of data.

Here is the code:

val df = sq.createDataFrame(excelData, schema)

log.info(Writing Spark DataFrame as ORC file...)
df.write.mode(SaveMode.Overwrite).option("compression", "snappy").orc("myfile.orc")

Here are my Yarn Logs:

17/06/16 17:03:13 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at scala.StringContext.standardInterpolator(StringContext.scala:123)
    at scala.StringContext.s(StringContext.scala:90)
    at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
    at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
    at preprocess.Run$.main(Run.scala:109)
    at preprocess.Run.main(Run.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

What is happening here? I feel the size of the serialized task is too large.

sweeeeeet
  • 1,769
  • 4
  • 26
  • 50
  • Why do you broadcast your excelData? – Tom Lous Jun 16 '17 at 10:12
  • Well I broadcast it since it looks too big to transmit it to every worker. Am I wrong? – sweeeeeet Jun 16 '17 at 10:32
  • actually, broadcasting will transmit the entire dataset to each worker. That's why you should only use it if absolutely necessary. Otherwise just let Spark decide how to split up the work across the cluster. On the other hand you seem justo to transform the excel to a orc file. Do you even need Spark for that? – Tom Lous Jun 16 '17 at 10:51
  • I have to use Spark, it's the only tool allowed. – sweeeeeet Jun 16 '17 at 10:57
  • @sweeeeeet your question is still considered not answered as long as you didn't accept it. Deleting the answer might reduce viewability for other user. – eliasah Jun 16 '17 at 13:32
  • That said, I still don't see the constraint on not using broadcast variables. Please do explain ! – eliasah Jun 16 '17 at 13:33
  • Is there even a real problem? The last line mentions a successful write. – Tom Lous Jun 16 '17 at 13:34
  • Some info about broadcasts: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html – Tom Lous Jun 16 '17 at 13:35
  • The thing is the task I am using is really far from best practices. That's a warning for me. – sweeeeeet Jun 16 '17 at 14:09

0 Answers0