Using Spark packages with Jupyter Notebook on HD Insight

Question

I'm trying to use graphFrames on PySpark via a Jupyter notebook. My Spark cluster is on HD Insight, so I don't have access to edit kernel.json.

The solutions suggested [here][1] and [here][2] didn't work. This is what I tried to run:

import os
packages = "graphframes:graphframes:0.3.0-spark2.0" # -s_2.11
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
from graphframes import *

This resulted in an error that a module named graphframes doesn't exist. Is there a way to initiate a new SparkContext after changing this env variable?

I've also tried passing the PYSPARK_SUBMIT_ARGS variable to IPython via the %set_env magic command and then importing graphframes:

%set_env PYSPARK_SUBMIT_ARGS='--packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 pyspark-shell'

from graphframes import *

But this resulted in the same error.

I saw some suggestions to pass the jar to IPython, but I'm not sure how to download the needed jar to my HD Insight cluster.

Do you have any suggestions?

score 1 · Accepted Answer · edited May 23 '17 at 11:46

1

It turns out I had two separate issues:

1) I was using the wrong syntax to configure the notebook. you should use:

# For HDInsight 3.3 and HDInsight 3.4
%%configure 
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }

# For HDInsight 3.5
%%configure 
{ "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.10:1.4.0" }}

Here are the relevant docs from Microsoft.

2) According to this useful answer, there seems to be bug in Spark that causes it to miss the package's jar. This worked for me:

sc.addPyFile(os.path.expanduser('./graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar'))

edited May 23 '17 at 11:46

Community

1
1

answered Mar 01 '17 at 07:34

mbrg

498
10
24

Hi, I'm trying to run spark on my local system from Jupyter Notebook and I want to add few packages for Cassandra Connecter. Will your solution work for this as well? – Abdul Haseeb Dec 25 '20 at 07:59

Using Spark packages with Jupyter Notebook on HD Insight

1 Answers1