1

Created a spark cluster through gcloud console with following options

gcloud dataproc clusters create cluster-name --region us-east1 --num-masters 1 --num-workers 2 --master-machine-type n1-standard-2 --worker- machine-type n1-standard-1 --metadata spark-packages=graphframes:graphframes:0.2.0-spark2.1-s_2.11

On spark master node - launched pyspark shell as follows:

pyspark --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11

...

found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in spark-packages

[SUCCESSFUL ] graphframes#graphframes;0.2.0-spark2.0-s_2.11!graphframes.jar (578ms)

...

    graphframes#graphframes;0.2.0-spark2.0-s_2.11 from spark-packages in [default]
    org.scala-lang#scala-reflect;2.11.0 from central in [default]
    org.slf4j#slf4j-api;1.7.7 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   5   |   5   |   5   |   0   ||   5   |   5   |
    ---------------------------------------------------------------------

...

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'.

>>> from graphframes import *

Traceback (most recent call last): File "", line 1, in ImportError: No module named graphframes

How do I load graphframes on gcloud dataproc spark cluster?

Progmatix
  • 13
  • 4
  • `--packages` specifies Java/Scala packages, right? Is there a python package you need to download as well? If you have to `pip install graphframes`, please ensure it doesn't depend on the `pyspark` or `py4j` packages. Installing either one of those through `pip` will break `pyspark` on your cluster :( Instead, just install `graphframes` without those dependencies. – Karthik Palaniappan May 16 '18 at 16:25

1 Answers1

1

Seems to be a known issue that you have jump through hoops to get it working in pyspark: https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172

Karthik Palaniappan
  • 1,373
  • 8
  • 11