0

I'm running a script using the Python Client Library for Google Cloud Dataproc that automatically provisions clusters, submits jobs, etc. But while trying to submit a job, it returns with ImportError: no module named pandas. I import pandas, as well as several other packages in my script from which the job runs. I'm not sure how to get around this issue.

so does this make sense?

    #!/bin/bash
    ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
    if [[ "${ROLE}" == 'Master' ]]; then 
        apt-get install python-pandas -y
        apt-get install python-numpy -y
        apt-get install g++ cmake
        apt-get install python-math
        apt-get install python-argparse
        apt-get install python-os
        apt-get install python-sys
        apt-get install python-glob
        apt-get install python-gzip
        apt-get install python-hail
     fi

Here is my updated bash script:

    #!/bin/bash
    list= "python-pandas, python-numpy, python-argparse"

    ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)

    if [[ "${ROLE}" == 'Master' ]]; then 
        for i in $list; do
          sudo apt-get install -y $i
        done

        wget -P /home/anaconda2/ https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
        bash /home/anaconda2/Anaconda2-4.3.1-Linux-x86_64.sh -b -f -p /home/anaconda2/
        chmod /home/anaconda2 0777
        /home/anaconda2/bin/pip install lxml
        /home/anaconda2/bin/pip install jupyter-spark
        /home/anaconda2/bin/pip install jgscm

    fi
claudiadast
  • 419
  • 1
  • 9
  • 18

1 Answers1

1

Pands isn't installed by default on Dataproc. You can install custom python libraries via an initialization action similar to this one.

For reference, I've run the following just to verify that pandas is found on at least one node:

#!/usr/bin/python
import pyspark
import pandas
sc = pyspark.SparkContext()
vals = sc.parallelize(xrange(1, 100))
reprs = vals.mapPartitions(lambda es: [repr(pandas) for e in es])
for r in reprs.collect():
  print r

My initialization action is then simply:

#!/bin/bash
apt-get install python-pandas python-numpy -y
Angus Davis
  • 2,673
  • 13
  • 20
  • I added what my version of the initialization action script should be above. Does that all make sense? – claudiadast Sep 22 '17 at 21:12
  • Pandas and numpy certainly make sense, I wonder a bit if all of the others are necessary at run time (g++ and make seem fishy if you're using prebuilt packages). You may also want to simply install these packages on all nodes and not just the master (depending on what you're doing). One last note, a single invocation of apt-get install will save you some startup time. – Angus Davis Sep 22 '17 at 21:57
  • 1
    When I simply try the initialization script that you provided the link to above (the one that installs pandas), it still gives me an error of `no module named pandas` when I submit a pyspark job. – claudiadast Sep 25 '17 at 22:48
  • I've updated to include my test script and init action. My test is admittedly not "Does pandas work", but is instead "Is pandas installed on the python path". – Angus Davis Sep 27 '17 at 21:28