AWS glue pyspark: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

Question

I'm trying to read a csv file from s3 in my AWS glue pyspark script. Following is the snippet of the code:-

import sys
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

argList = ['config']
args = getResolvedOptions(sys.argv,argList)

print(f"The config path is: {args['config']}")

sc = SparkContext.getOrCreate()
sch = sc._jsc.hadoopConfiguration()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

sch.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sch.set("fs.s3.canned.acl","BucketOwnerFullControl")

source_path_url = "s3://bucket/folder"
df = spark.read.option("header", "true").option("inferSchema", "true").csv(source_path_url)

While executing it, I am getting the following error:-

: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:343)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:333)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:615)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 24 more

Do I need to provide jets3t jars in glue? If yes why so because these jars are provided by glue automatically in case of scala spark job runtimes.

score 1 · Accepted Answer · answered Sep 20 '21 at 06:30

1

I found the solution. As I had suspected in original post, you need to download jets3t jar externally and store it in some s3 location. After that you can update the s3 path of stored jar in job parameters section of glue job as key:"--extra-jars" value:"s3_path_to_jets3t_jar"

or alternatively you can set the path of jar in Dependent jars path section of glue job.

answered Sep 20 '21 at 06:30

Harsh P Waghela

63
1
9

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 20 '21 at 06:53

AWS glue pyspark: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

1 Answers1