0

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.

/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes  --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar  --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2  file:///AirflowData/kubernetes/python/postgresminioKube.py

Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
  "user" : "username",
  "password" : "password",
  "driver": "org.postgresql.Driver",
  "fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')

Error is below . It is trying to execute the jar for some reason

21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573

21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
Rafa
  • 487
  • 7
  • 22
  • Is there a stack trace belonging to the ERROR line? – Bernhard Stadler Nov 16 '21 at 19:53
  • Is the path `/AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar` on the machine from which you submit or inside that `hostname:5000/spark-py:spark3.1.2` container image or somewhere else? – Bernhard Stadler Nov 16 '21 at 20:07
  • ... either way, please have a close look at the [Dependency Management section](https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management) of the Spark on Kubernetes documentation. – Bernhard Stadler Nov 16 '21 at 20:09
  • The path is mounted on all nodes. But it works only if the jar are built in inside the image. – Rafa Nov 22 '21 at 13:12
  • Were you still using the command you posted originally? Changing it according to the doc section I referenced might help. – Bernhard Stadler Dec 14 '21 at 22:16

1 Answers1

0

The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.

RUN chmod 777 /opt/spark/work-dir
Rafa
  • 487
  • 7
  • 22