I'm trying to use graphFrames on PySpark via a Jupyter notebook. My Spark cluster is on HD Insight, so I don't have access to edit kernel.json.
The solutions suggested [here][1] and [here][2] didn't work. This is what I tried to run:
import os
packages = "graphframes:graphframes:0.3.0-spark2.0" # -s_2.11
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
from graphframes import *
This resulted in an error that a module named graphframes doesn't exist. Is there a way to initiate a new SparkContext after changing this env variable?
I've also tried passing the PYSPARK_SUBMIT_ARGS variable to IPython via the %set_env magic command and then importing graphframes:
%set_env PYSPARK_SUBMIT_ARGS='--packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 pyspark-shell'
from graphframes import *
But this resulted in the same error.
I saw some suggestions to pass the jar to IPython, but I'm not sure how to download the needed jar to my HD Insight cluster.
Do you have any suggestions?