I am working with DeepLearning4j library. I am running everything on HPC and I generate a jar file to submit with spark-submit. I am using the version M1.1. Everything was fine with the CPU but when I switched to GPU, I got this error:
Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:openblas:0.3.13-1.5.5 do not match.
Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:opencv:4.5.1-1.5.5 do not match.
22/08/03 21:05:26 INFO BaseImageRecordReader: ImageRecordReader: 1000 label classes inferred using label generator ParentPathLabelGenerator
iterator
data list creator
java.lang.RuntimeException: No CUDA devices were found in system
at org.nd4j.linalg.jcublas.JCublasBackend.canRun(JCublasBackend.java:69)
at org.nd4j.linalg.jcublas.JCublasBackend.isAvailable(JCublasBackend.java:52)
at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:160)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/08/03 21:05:26 WARN Nd4jBackend: Skipped [JCublasBackend] backend (unavailable): java.lang.RuntimeException: No CUDA devices were found in system
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5095)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
... 25 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:196)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
... 26 more
My pom.xml is:
<properties>
<dl4j-master.version>1.0.0-M1.1</dl4j-master.version>
<!-- Change the nd4j.backend property to nd4j-cuda-X-platform to use CUDA GPUs -->
<!-- <nd4j.backend>nd4j-cuda-10.2-platform</nd4j.backend> -->
<nd4j.backend>nd4j-cuda-11.0-platform</nd4j.backend>
<java.version>1.8</java.version>
<shadedClassifier>bin</shadedClassifier>
<scala.binary.version>2.11</scala.binary.version>
<maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
<maven.minimum.version>3.3.1</maven.minimum.version>
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
<maven-shade-plugin.version>2.4.3</maven-shade-plugin.version>
<jcommon.version>1.0.23</jcommon.version>
<jfreechart.version>1.0.13</jfreechart.version>
<logback.version>1.1.7</logback.version>
<jcommander.version>1.27</jcommander.version>
<spark.version>2.4.8</spark.version>
<jackson.version>2.5.1</jackson.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>${exec-maven-plugin.version}</version>
<executions>
<execution>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<executable>java</executable>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${maven-shade-plugin.version}</version>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>${shadedClassifier}</shadedClassifierName>
<createDependencyReducedPom>true</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>org/datanucleus/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
<!-- Added to enable jar creation using mvn command-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<archive>
<manifest>
<mainClass>fully.qualified.MainClass</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>${nd4j.backend}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-11.0</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.datavec</groupId>
<artifactId>datavec-spark_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark-parameterserver_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
<version>${jcommander.version}</version>
</dependency>
<!-- Used for patent classification example -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-zoo</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-11.0</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
</dependencies>
And these are my loaded dependencies :
1) modenv/scs5 (S) 7) Tcl/8.6.9-GCCcore-8.3.0 13) BigDataFrameworkConfigure/0.0.2 19) zlib/1.2.11-GCCcore-9.3.0
2) Maven/3.6.3 8) SQLite/3.29.0-GCCcore-8.3.0 14) Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 20) binutils/2.34-GCCcore-9.3.0
3) Java/1.8.0_161-OpenJDK 9) XZ/5.2.4-GCCcore-8.3.0 15) CUDAcore/11.0.2 21) GCC/9.3.0
4) bzip2/1.0.8-GCCcore-8.3.0 10) GMP/6.1.2-GCCcore-8.3.0 16) numactl/2.0.14-GCCcore-10.3.0 22) CUDA/11.0.2-GCC-9.3.0
5) ncurses/6.1-GCCcore-8.3.0 11) libffi/3.2.1-GCCcore-8.3.0 17) NVHPC/21.7 23) nvidia-nsight/2019.3.1
6) libreadline/8.0-GCCcore-8.3.0 12) Python/3.7.4-GCCcore-8.3.0 18) GCCcore/9.3.0
Could anyone help me please. Thank you!