Hadoop security GroupMappingServiceProvider exception for Spark job via Dataproc API

Question

I am trying to run a Spark job on a google dataproc cluster, but get the following error:

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMapping not org.apache.hadoop.security.GroupMappingServiceProvider
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2330)
    at org.apache.hadoop.security.Groups.<init>(Groups.java:108)
    at org.apache.hadoop.security.Groups.<init>(Groups.java:102)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:450)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:310)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at com.my.package.spark.SparkModule.provideJavaSparkContext(SparkModule.java:59)
    at com.my.package.spark.SparkModule$$ModuleAdapter$ProvideJavaSparkContextProvidesAdapter.get(SparkModule$$ModuleAdapter.java:140)
    at com.my.package.spark.SparkModule$$ModuleAdapter$ProvideJavaSparkContextProvidesAdapter.get(SparkModule$$ModuleAdapter.java:101)
    at dagger.internal.Linker$SingletonBinding.get(Linker.java:364)
    at spark.Main$$InjectAdapter.get(Main$$InjectAdapter.java:65)
    at spark.Main$$InjectAdapter.get(Main$$InjectAdapter.java:23)
    at dagger.ObjectGraph$DaggerObjectGraph.get(ObjectGraph.java:272)
    at spark.Main.main(Main.java:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMapping not org.apache.hadoop.security.GroupMappingServiceProvider
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2324)
    ... 31 more

Dataproc versions: 1.1.51 and 1.2.15

Job configuration:

Region: global
Cluster my-cluster
Job type: Spark
Jar files: gs://bucket/jars/spark-job.jar
Main class or jar: spark.Main
Arguments:
Properties:
spark.driver.extraClassPath: /path/to/google-api-client-1.20.0.jar
spark.driver.userClassPathFirst: true

I have no problem running it this way on the command line:

spark-submit --conf "spark.driver.extraClassPath=/path/to/google-api-client-1.20.0.jar" --conf "spark.driver.userClassPathFirst=true" --class spark.Main /path/to/spark-job.jar

But the UI/API does not allow you to pass both the class name and jar, so it looks like this instead:

spark-submit --conf spark.driver.extraClassPath=/path/to/google-api-client-1.20.0.jar --conf spark.driver.userClassPathFirst=true --class spark.Main --jars /tmp/1f4d5289-37af-4311-9ccc-5eee34acaf62/spark-job.jar /usr/lib/hadoop/hadoop-common.jar

I can't figure out if it is a problem with providing the extraClassPath or if the spark-job.jar and the hadoop-common.jar are somehow conflicting.

score 4 · Answer 1 · answered Sep 16 '21 at 00:13

For anyone that encounters this issue (even if you aren't using Spark)

This exception is raised here Configuration.java#L2585

Configuration.java

Meaning of the Exception

Hadoop is telling you, the second class is not a subclass or subinterface of the other class.

This is what Class.isAssignableFrom(OtherClass) does on line 2584 in the image above

The java.lang.Class.isAssignableFrom() determines if the class or interface represented by this Class object is either the same as, or is a superclass or superinterface of, the class or interface represented by the specified Class parameter.

But we know that

org.apache.hadoop.security.GroupMappingServiceProvider

Is a superclass of

org.apache.hadoop.security.JniBasedUnixGroupsMapping

So how can this happen?

Why this happens

This can happen for several reasons

You have two versions of hadoop-client libraries on your classpath
You've bundled Hadoop with a library, and it's a different version or different release than another bundled Hadoop library.
Your classpaths point to multiple Hadoop installations or Hadoop clients when they should only have a single hadoop-client jar
You are using the wrong Hadoop package - for instance you are targeting Cloudera environments and you are using the open source version of Hadoop libs

Solutions

Conform all your Hadoop library versions to the same version
Create a meta package and depend only on this single meta package jar across all your plugins, clients, packages.
Switch to using the proper release version of your client library - example:
- Cloudera shaded hadoop-client-api
- Cloudera shaded hadoop-client-runtime
Use shaded versions of Hadoop Client Libs via hadoop-client-api and hadoop-client-runtime

score 2 · Accepted Answer · answered Dec 21 '17 at 19:29

2

I think this is caused by the combination of userClassPathFirst and /usr/lib/hadoop/hadoop-common.jar being the jar Dataproc specifies to spark-submit. In some cases, the instance of GroupMappingServiceProvider from the user class loader will be used and in others the instance from the system class loader will be used. As a class loaded from one class loader is not equal to the same class loaded from another class loader, you would end up with this exception.

Instead of userClassPathFirst, would it make sense to instead relocate the conflicting classes using something like maven shade?

answered Dec 21 '17 at 19:29

Angus Davis

2,673
13
20

Thanks. I am already using maven, so I added relocation described [here](https://stackoverflow.com/questions/33922719/running-app-jar-file-on-spark-submit-in-a-google-dataproc-cluster-instance/33925408). Now I can get the job started without the extraClassPath, but get a different hadoop error. `java.util.ServiceConfigurationError: org.apache.hadoop.io.compress.CompressionCodec: Provider org.apache.hadoop.io.compress.Lz4Codec not a subtype` Using relocation for the org.apache.hadoop.io.compress did not help. – MRR Dec 28 '17 at 20:11
Is this with userClassPathFirst or without? Once api-client is relocated in your jar, I would expect you to be able to execute your job without using userClassPathFirst. – Angus Davis Dec 28 '17 at 20:14
Also without userClassPathFirst `spark-submit --conf spark.driver.userClassPathFirst=false --class spark.Main --jars /path/to/spark-job-SHADED.jar /usr/lib/hadoop/hadoop-common.jar` – MRR Dec 28 '17 at 20:16
This runs fine without the hadoop-common.jar: `spark-submit --conf spark.driver.userClassPathFirst=false --class spark.Main /path/to/spark-job-SHADED.jar` – MRR Dec 28 '17 at 20:19
I forgot to also set the `spark.executor.userClassPathFirst=false` also, no more hadoop errors. – MRR Jan 19 '18 at 01:24

score 0 · Answer 3 · answered May 25 '18 at 08:41

If you don't want to turn off spark.driver.userClassPathFirst=true you could check that "org.apache.spark" %% "spark-core" % SPARK_VERSION dependency is present and scope is defined correctly. When spark-core jar is in classpath exception won't have been thrown.

Hadoop security GroupMappingServiceProvider exception for Spark job via Dataproc API

3 Answers3

Meaning of the Exception

Why this happens

Solutions