4

I want to use GraphFrames with PySpark (currently using Spark v2.3.3, on Google Dataproc).

After installing GraphFrames with

pip install graphframes

I try to run the follwing code:

from graphframes import *

localVertices = [(1,"A"), (2,"B"), (3, "C")]

localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]

v = sqlContext.createDataFrame(localVertices, ["id", "name"])

e = sqlContext.createDataFrame(localEdges, ["src", "dst", "action"])

g = GraphFrame(v, e)

but I get this error:

Py4JJavaError: An error occurred while calling o301.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Any ideas how to fix this issue?

Alex
  • 1,447
  • 7
  • 23
  • 48

2 Answers2

4

To use GraphFrames with Spark, you should install it as a Spark package, not a PIP package:

pyspark --packages graphframes:graphframes:0.7.0-spark2.3-s_2.11
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • I have passed that as an env variable in my Dockerfile `ENV PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12"` but I do still get the error: `File "/usr/local/spark/jobs/cetus/graph_post_processor_job.py", line 7, in from graphframes import * ModuleNotFoundError: No module named 'graphframes'`. – Malgi Jul 01 '21 at 00:47
2

In case you are using Jupyter for development start it from pyspark and not directly or from Anaconda. Meaning open the terminal and then run

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

This starts Jupyter with the correct PySpark packages loaded in the background. If you then import it in your script with from graphframes import it will pick it up correctly and run

Alex Ortner
  • 1,097
  • 8
  • 24