I'm running a script using the Python Client Library for Google Cloud Dataproc that automatically provisions clusters, submits jobs, etc. But while trying to submit a job, it returns with ImportError: no module named pandas
. I import pandas, as well as several other packages in my script from which the job runs. I'm not sure how to get around this issue.
so does this make sense?
#!/bin/bash
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install python-pandas -y
apt-get install python-numpy -y
apt-get install g++ cmake
apt-get install python-math
apt-get install python-argparse
apt-get install python-os
apt-get install python-sys
apt-get install python-glob
apt-get install python-gzip
apt-get install python-hail
fi
Here is my updated bash script:
#!/bin/bash
list= "python-pandas, python-numpy, python-argparse"
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
for i in $list; do
sudo apt-get install -y $i
done
wget -P /home/anaconda2/ https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
bash /home/anaconda2/Anaconda2-4.3.1-Linux-x86_64.sh -b -f -p /home/anaconda2/
chmod /home/anaconda2 0777
/home/anaconda2/bin/pip install lxml
/home/anaconda2/bin/pip install jupyter-spark
/home/anaconda2/bin/pip install jgscm
fi