AWS JupyterHub pyspark notebook to use pandas module

Question

I have a docker container with JupyterHub installed, running on AWS cluster, as described here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html. It has Python 3 kernel, PySpark 3, PySpark, SparkR, and Spark kernels, and inside the container there are installed conda and many other python packages, but no spark. The problem is that when I run pyspark or pyspark3 kernel, it connects to spark, installed on main node (outside the docker container), and all the internal modules are not available for this notebook any more (they are visible to python kernel though, but then spark is not visible in this case).

So question is how to make modules installed inside the docker to be available and visible to the pyspark/pyspark3 notebook? I think there is something in the settings I'm missing.

I'm pretty much looking for the way to use docker's internally installed modules WITH the externally installed spark in one notebook.

So far I can get only one or another.

score 1 · Answer 1 · answered Sep 19 '18 at 21:30

I just found the half of the answer here https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d and here https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-notebook-kernels. The secret is to use %%local magic in the cell, which lets us access python modules installed locally (in docker container). Now i just don't know how to persist pandas dataframe created in "pyspark part" of the notebook, so it is available in "local" part.

AWS JupyterHub pyspark notebook to use pandas module

1 Answers1