How to run Spark structured streaming using local JAR files

Question

I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark). As per the integration guide, I run a Python script as following.

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
  <myscript>.py

It download the package from Maven central and I see some JAR files are downloaded into ~/.ivy2/jars.

com.github.luben_zstd-jni-1.4.8-1.jar       org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.2.jar             org.slf4j_slf4j-api-1.7.30.jar
org.apache.commons_commons-pool2-2.6.2.jar  org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.2.jar  org.spark-project.spark_unused-1.0.0.jar
org.apache.kafka_kafka-clients-2.6.0.jar    org.lz4_lz4-java-1.7.1.jar                                       org.xerial.snappy_snappy-java-1.1.8.2.jar

However I find the main JAR file is already download into $SPARK_HOME/external/lib and I wonder how to make use of it instead of downloading it.

spark-avro_2.12-3.1.2-amzn-1.jar          spark-ganglia-lgpl.jar                      spark-streaming-kafka-0-10-assembly_2.12-3.1.2-amzn-1.jar   spark-streaming-kinesis-asl-assembly.jar
spark-avro.jar                            **spark-sql-kafka-0-10_2.12-3.1.2-amzn-1.jar  spark-streaming-kafka-0-10-assembly.jar                     spark-token-provider-kafka-0-10_2.12-3.1.2-amzn-1.jar
spark-ganglia-lgpl_2.12-3.1.2-amzn-1.jar  **spark-sql-kafka-0-10.jar                    spark-streaming-kinesis-asl-assembly_2.12-3.1.2-amzn-1.jar  spark-token-provider-kafka-0-10.jar

UPDATE 2022-03-09

I tried after updating spark-defaults.conf as shown below - added the external lib folder.

spark.driver.extraClassPath      /usr/lib/spark/external/lib/*:...
spark.driver.extraLibraryPath    ...
spark.executor.extraClassPath    /usr/lib/spark/external/lib/*:...
spark.executor.extraLibraryPath  ...

I can run without --packages but it fails with the following error.

22/03/09 05:37:25 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:623)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:52)
        at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:40)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:60)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.pool2.impl.GenericKeyedObjectPoolConfig
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 33 more

It doesn't help although I tried with adding --packages org.apache.commons:commons-pool2:2.6.2.

score 0 · Answer 1 · answered Mar 07 '22 at 21:10

0

You would use --jars to refer to local filesystem in-place of --packages

answered Mar 07 '22 at 21:10

OneCricketeer

179,855
19
132
245

Simply adding `--jars $SPARK_HOME/external/lib/spark-sql-kafka-0-10.jar` doesn't seem to work as it requires multiple JAR files. Adding every JAR file doesn't seem to be practical. – Jaehyeon Kim Mar 08 '22 at 01:43
Well, this is the answer you're looking for. Nothing in the external folder is used, by default. Besides, you only need kafka-clients JAR here, and you'd still need to include that if you moved the sql-kafka jar into Spark's main lib folder on all executors – OneCricketeer Mar 08 '22 at 13:13
No, it's not the answer I'm looking for. What I need is how to work on Spark structured streaming without downloading an external package. If you think your answer is sufficient, please give me full details. – Jaehyeon Kim Mar 09 '22 at 00:35
@Jae That's not possible. Spark by itself doesn't include Kafka libraries. They'd have to be downloaded at some point. If they are available on disk by EMR, then fine, but you still need to modify the execution command to reference those files – OneCricketeer Mar 09 '22 at 15:14
See my update. There should be a reason that the JAR files are downloaded already. I'm looking for how to make use of those. For example, if I want to use Hudi, I can use an existing one. – Jaehyeon Kim Mar 09 '22 at 22:41

Jaehyeon Kim · Accepted Answer · 2022-03-10T23:24:29.310

Unfortunately I cannot submit an app only with the JAR files in $SPARK_HOME/external/lib due to an error. The details of the error are updated to the question. Instead I ended up pre-downloading the package JAR files and using those.

I first ran with the following command. Here foo.py is an empty file and it'll download the package JAR files into /home/hadoop/.ivy2/jars.

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
  foo.py

Then I updated spark-defaults.conf as following.

spark.driver.extraClassPath      /home/hadoop/.ivy2/jars/*:...
spark.driver.extraLibraryPath    ...
spark.executor.extraClassPath    /home/hadoop/.ivy2/jars/*:...
spark.executor.extraLibraryPath  ...

After that, I ran the submit command without --packages and it worked without an error.

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  <myscript>.py

This approach is likely to be useful when it takes long to download package JAR files as they can be pre-downloaded. Note EMR on EKS supports using a custom image from ECR.

Ideally, you wouldn't use `--master local` in EMR. That being said, all the executors need to download the same files unless you use something like rsync or Ansible to distribute them internally to your VPC (or package in a container, like you say, but you still need the Docker daemon to download those images) — OneCricketeer, Mar 10 '22 at 13:59
This is for local development. For production, it'll be deployed to EKS. EMR on EKS supports a custom image from ECR (https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images.html). I can build an image with those JAR files included. It would be beneficial if it takes long to download all the JAR files. — Jaehyeon Kim, Mar 10 '22 at 21:48

How to run Spark structured streaming using local JAR files

2 Answers2