Solution-
Add following env variables to the container
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
Trying to create a spark container and spark-submit
a pyspark script.
I am able to create the container but running the pyspark script throws the following error:
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
Questions :
- Any idea why this error is occurring ?
- Do i need to install python separately or does it comes bundled with spark download ?
- Do i need to install Pyspark separately or does it comes bundled with spark download ?
- What is preferable regarding python installation? download and put it under
/opt/python
or useapt-get
?
pyspark script:
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
output of spark-submit:
newuser@c1f28230da16:~$ spark-submit count.py
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/02/01 19:58:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: error=2, No such file or directory at java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319) at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) ... 15 more log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
output of printenv:
newuser@c1f28230da16:~$ printenv
HOME=/home/newuser LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36: PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin _=/usr/bin/printenv
myspark dockerfile:
JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz #MAINTAINER demo@gmail.com #LABEL maintainer="demo@foo.com" ############################################ ### Install openjava ############################################ # Base image stage 1 FROM ubuntu as jdk ARG JAVA_HOME ARG JDK_PACKAGE WORKDIR /opt/ ## download open java # ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE / # ADD $JDK_PACKAGE / COPY $JDK_PACKAGE . RUN mkdir -p $JAVA_HOME/ && \ tar -zxf $JDK_PACKAGE --strip-components 1 -C $JAVA_HOME && \ rm -f $JDK_PACKAGE ############################################ ### Install spark search ############################################ # Base image stage 2 From ubuntu as spark #ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE WORKDIR /opt/ ## download spark COPY $SPARK_PACKAGE . RUN mkdir -p $SPARK_HOME/ && \ tar -zxf $SPARK_PACKAGE --strip-components 1 -C $SPARK_HOME && \ rm -f $SPARK_PACKAGE # Mount elasticsearch.yml config ### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml ############################################ ### final ############################################ From ubuntu as finalbuild ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE WORKDIR /opt/ # get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME $SPARK_HOME # Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME # setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH $PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip # Expose ports # EXPOSE 9200 # EXPOSE 9300 # Define mountable directories. #VOLUME ["/data"] ## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash && \ echo 'newuser:newpassword' | chpasswd && \ chown -R newuser $SPARK_HOME $JAVA_HOME && \ chown -R newuser:newuser /home/newuser && \ chmod 755 /home/newuser #chown -R newuser:newuser /home/newuser #chown -R newuser /home/newuser && \ # Install Python RUN apt-get update && \ apt-get install -yq curl && \ apt-get install -yq vim && \ apt-get install -yq python3.9 ## Install PySpark and Numpy #RUN \ # pip install --upgrade pip && \ # pip install numpy && \ # pip install pyspark # USER newuser WORKDIR /home/newuser # RUN chown -R newuser /home/newuser