16

My spark program on EMR is constantly getting this error:

Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
    at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:421)
    at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:128)
    at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:397)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
    at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149)
    at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:573)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:172)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)

I did some research and found out that this authentication can be disabled in low-security situation, by setting environment variable:

com.amazonaws.sdk.disableCertChecking=true

but I can only set it with spark-submit.sh --conf, which only affects driver, while most of the errors are on workers.

Is there a way to propagate them to workers?

Thanks a lot.

tribbloid
  • 4,026
  • 14
  • 64
  • 103

4 Answers4

15

Just stumbled upon something in the Spark documentation:

spark.executorEnv.[EnvironmentVariableName]

Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables.

So in your case, I'd set the Spark configuration option spark.executorEnv.com.amazonaws.sdk.disableCertChecking to true and see if that helps.

stholzm
  • 3,395
  • 19
  • 31
5

Adding more to the existing answer.

import pyspark


def get_spark_context(app_name):
    # configure
    conf = pyspark.SparkConf()
    conf.set('spark.app.name', app_name)

    # init & return
    sc = pyspark.SparkContext.getOrCreate(conf=conf)

    # Configure your application specific setting

    # Set environment value for the executors
    conf.set(f'spark.executorEnv.SOME_ENVIRONMENT_VALUE', 'I_AM_PRESENT')

    return pyspark.SQLContext(sparkContext=sc)

SOME_ENVIRONMENT_VALUE environment variable will be available in the executors/workers.

In your spark application, you can access them like this:

import os
some_environment_value = os.environ.get('SOME_ENVIRONMENT_VALUE')
yardstick17
  • 4,322
  • 1
  • 26
  • 33
1

Building upon other answers, here is a full example that works (PySpark 2.4.1). In this example I am forcing all the workers to spawn only one thread per core in Intel MKL Kernel library:

import pyspark

conf = pyspark.conf.SparkConf().setAll([
                                   ('spark.executorEnv.OMP_NUM_THREADS', '1'),
                                   ('spark.workerEnv.OMP_NUM_THREADS', '1'),
                                   ('spark.executorEnv.OPENBLAS_NUM_THREADS', '1'),
                                   ('spark.workerEnv.OPENBLAS_NUM_THREADS', '1'),
                                   ('spark.executorEnv.MKL_NUM_THREADS', '1'),
                                   ('spark.workerEnv.MKL_NUM_THREADS', '1'),
                                   ])

spark = pyspark.sql.SparkSession.builder.config(conf=conf).getOrCreate()

# print current PySpark configuration to be sure
print("Current PySpark settings: ", spark.sparkContext._conf.getAll())
Ivan Bilan
  • 2,379
  • 5
  • 38
  • 58
1

For spark 2.4 , @Amit Kushwaha 's method doesn't work.

I have tested:

1. cluster mode

spark-submit --conf spark.executorEnv.DEBUG=1 --conf spark.appMasterEnv.DEBUG=1 --conf spark.yarn.appMasterEnv.DEBUG=1 --conf spark.yarn.executorEnv.DEBUG=1 main.py

2. client mode

spark-submit --deploy-mode=client --conf spark.executorEnv.DEBUG=1 --conf spark.appMasterEnv.DEBUG=1 --conf spark.yarn.appMasterEnv.DEBUG=1 --conf spark.yarn.executorEnv.DEBUG=1 main.py

None of above can set enviroment variable into executor system(aka. can not read by os.environ.get('DEBUG')) .


The only way is get from spark.conf:

submit:

spark-submit --conf DEBUG=1 main.py

get variable:

DEBUG = spark.conf.get('DEBUG')
Mithril
  • 12,947
  • 18
  • 102
  • 153