Read a large zstd file with Spark (Scala)

Question

I'm trying to load a zstd-compressed JSON file with the archive size of 16.4 GB using Spark 3.1.1 with Scala 2.12.10. See Sample file for reference.

For reference, my PC has 32 GB of RAM. The zstd decompressor I'm using is from the native libraries via

LD_LIBRARY_PATH=/opt/hadoop/lib/native

My configuration:

trait SparkProdContext {

  private val master = "local[*]"
  private val appName = "testing"
  private val conf: SparkConf = new SparkConf()
    .setMaster(master)
    .setAppName(appName)
    .set("spark.driver.allowMultipleContexts", "false")
    .set("spark.ui.enabled", "false")

  val ss: SparkSession = SparkSession.builder().config(conf).getOrCreate()
  val sc: SparkContext = ss.sparkContext
  val sqlContext: SQLContext = ss.sqlContext

}

My code:

  import org.apache.spark.sql.types.StructType
  import org.apache.spark.sql.catalyst.ScalaReflection



  import ss.implicits._


  case class Comment (
                     author: String,
                     body: String,
                     score: BigInt,
                     subreddit_id: String,
                     subreddit: String,
                     id: String,
                     parent_id: String,
                     link_id: String,
                     retrieved_on: BigInt,
                     created_utc: BigInt,
                     permalink: String
                     )

  val schema = ScalaReflection.schemaFor[Comment].dataType.asInstanceOf[StructType]

  val comments = ss.read.schema(schema).json("/home/user/Downloads/RC_2020-03.zst").as[Comment]

Upon running the code I'm getting the following error.

22/01/06 23:59:44 INFO CodecPool: Got brand-new decompressor [.zst]
22/01/06 23:59:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: Frame requires too much memory for decoding
    at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native Method)
    at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
    at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
    at java.base/java.io.InputStream.read(InputStream.java:205)
    at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:152)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:192)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
    at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:69)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)

Any ideas appreciated!

Your PC may have 32GB of RAM, but the executor JVM memory size defaults to much smaller than that; have you tried increasing such settings in the Spark config? — OneCricketeer, Jan 06 '22 at 22:24
Tried tweaking spark.driver.memory and spark.executor.memory to no avail. Then I've tried unzipping the zstd file with unzstd utility and was getting the same error. Finally was able to unzip it with `unzstd --long=31 RC_2020-03.zst` The 16.4 GB archive became a 221 GB file :) — cnstlungu, Jan 12 '22 at 20:41

Read a large zstd file with Spark (Scala)

0 Answers0