-1

I am working with DeepLearning4j library. I am running everything on HPC and I generate a jar file to submit with spark-submit. I am using the version M1.1. Everything was fine with the CPU but when I switched to GPU, I got this error:

Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:openblas:0.3.13-1.5.5 do not match.
Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:opencv:4.5.1-1.5.5 do not match.
22/08/03 21:05:26 INFO BaseImageRecordReader: ImageRecordReader: 1000 label classes inferred using label generator ParentPathLabelGenerator
iterator
data list creator
java.lang.RuntimeException: No CUDA devices were found in system
        at org.nd4j.linalg.jcublas.JCublasBackend.canRun(JCublasBackend.java:69)
        at org.nd4j.linalg.jcublas.JCublasBackend.isAvailable(JCublasBackend.java:52)
        at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:160)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
        at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
        at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
        at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
        at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
        at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
        at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/08/03 21:05:26 WARN Nd4jBackend: Skipped [JCublasBackend] backend (unavailable): java.lang.RuntimeException: No CUDA devices were found in system
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
        at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
        at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
        at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
        at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
        at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
        at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5095)
        at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
        ... 25 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
        at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:196)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
        ... 26 more

My pom.xml is:

<properties>
        <dl4j-master.version>1.0.0-M1.1</dl4j-master.version>
        <!-- Change the nd4j.backend property to nd4j-cuda-X-platform to use CUDA GPUs -->
        <!-- <nd4j.backend>nd4j-cuda-10.2-platform</nd4j.backend> -->
        <nd4j.backend>nd4j-cuda-11.0-platform</nd4j.backend>
        <java.version>1.8</java.version>
        <shadedClassifier>bin</shadedClassifier>
        <scala.binary.version>2.11</scala.binary.version>
        <maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
        <maven.minimum.version>3.3.1</maven.minimum.version>
        <exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
        <maven-shade-plugin.version>2.4.3</maven-shade-plugin.version>
        <jcommon.version>1.0.23</jcommon.version>
        <jfreechart.version>1.0.13</jfreechart.version>
        <logback.version>1.1.7</logback.version>
        <jcommander.version>1.27</jcommander.version>
        <spark.version>2.4.8</spark.version>
        <jackson.version>2.5.1</jackson.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <build>
        <plugins>


            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>${exec-maven-plugin.version}</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>java</executable>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>${maven-shade-plugin.version}</version>
                <configuration>
                    <shadedArtifactAttached>true</shadedArtifactAttached>
                    <shadedClassifierName>${shadedClassifier}</shadedClassifierName>
                    <createDependencyReducedPom>true</createDependencyReducedPom>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>org/datanucleus/**</exclude>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                            </excludes>
                        </filter>
                    </filters>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>reference.conf</resource>
                                </transformer>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <!--      Added to enable jar creation using mvn command-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>fully.qualified.MainClass</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <!-- bind to the packaging phase -->
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>


            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
        </plugins>
    </build>


    <dependencies>


        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>${nd4j.backend}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-11.0</artifactId>
            <version>1.0.0-M1.1</version>
        </dependency>


        <dependency>
            <groupId>org.datavec</groupId>
            <artifactId>datavec-spark_${scala.binary.version}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>dl4j-spark_${scala.binary.version}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>dl4j-spark-parameterserver_${scala.binary.version}</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>com.beust</groupId>
            <artifactId>jcommander</artifactId>
            <version>${jcommander.version}</version>
        </dependency>
        <!-- Used for patent classification example -->
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-nlp</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-zoo</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-core</artifactId>
            <version>1.0.0-M1.1</version>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-cuda-11.0</artifactId>
            <version>1.0.0-M1.1</version>
        </dependency>




        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

    </dependencies>

And these are my loaded dependencies :

 1) modenv/scs5                   (S)   7) Tcl/8.6.9-GCCcore-8.3.0      13) BigDataFrameworkConfigure/0.0.2                             19) zlib/1.2.11-GCCcore-9.3.0
  2) Maven/3.6.3                         8) SQLite/3.29.0-GCCcore-8.3.0  14) Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0  20) binutils/2.34-GCCcore-9.3.0
  3) Java/1.8.0_161-OpenJDK              9) XZ/5.2.4-GCCcore-8.3.0       15) CUDAcore/11.0.2                                             21) GCC/9.3.0
  4) bzip2/1.0.8-GCCcore-8.3.0          10) GMP/6.1.2-GCCcore-8.3.0      16) numactl/2.0.14-GCCcore-10.3.0                               22) CUDA/11.0.2-GCC-9.3.0
  5) ncurses/6.1-GCCcore-8.3.0          11) libffi/3.2.1-GCCcore-8.3.0   17) NVHPC/21.7                                                  23) nvidia-nsight/2019.3.1
  6) libreadline/8.0-GCCcore-8.3.0      12) Python/3.7.4-GCCcore-8.3.0   18) GCCcore/9.3.0



Could anyone help me please. Thank you!

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • `java.lang.RuntimeException: No CUDA devices were found in system` that means either you have no GPUs in your system, or your CUDA install is broken (perhaps GPU driver). If you are using e.g. a university HPC resource, you may need to submit your job to a GPU partition or queue, or request GPU resources. If you are running on a managed HPC cluster, there are usually cluster admins/help desk that can quickly sort these things out for you. No one can tell you what specifically you need to do to get a GPU machine based on what you have shown here. – Robert Crovella Aug 03 '22 at 19:22
  • Yes, that's what I fixed thank you! but nos I have another error: no jnind4jcuda in java.library.path linux-x86_64/libjnind4jcuda.so: /lib64/libm.so.6: version `GLIBC_2.23' not found – nour rekik Aug 04 '22 at 14:55

1 Answers1

0

Make sure the spark workers are running on a GPU system if you are using the cuda backend.

Ideally every machine that gets a cuda backend job for a worker will be the same otherwise you won't see much performance.

Those machines should also have the same drivers and expected cuda versions.

I'm not sure what your system configuration is but if you do that you shouldn't have issues with libraries.

Adam Gibson
  • 3,055
  • 1
  • 10
  • 12