1

I am using JupyterHub on an EMR and Pandas is not installed on the PySpark or PySpark3 kernels. These kernels also disallow use of !. I have tried to install using

import pip
pip.main(['install','pandas])

But this raises ValueError: I/O operation on closed file.

When I open the terminal kernel, pandas is already installed.

Please let me know if there are other ways to install to a specific kernel.

1 Answers1

1

Faced similar problems and this resolved my situation

#bootstrap
sudo python3 -m pip install <packages>
# set in $SPARK_HOME/conf/spark-env.sh or use the config.json template for EMR
export PYSPARK_DRIVER_PYTHON=python3
export PYSPARK_PYTHON=python3

Reference: AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

thePurplePython
  • 2,621
  • 1
  • 13
  • 34