I am spinning up an EMR in AWS. The difficulty arises when using Jupyter to import associated Python modules. I have a shell script that executes when the EMR starts and imports Python modules.
The notebook is set to run using the PySpark Kernel.
I believe the problem is that the Jupyter notebook is not pointed to the correct Python in EMR. The methods I have used to set the notebook to the correct version do not seem to work.
I have set the following configurations. I have tried changing python to python3.6 and python3.
Configurations=[{
"Classification": "spark-env",
"Properties": {},
"Configurations": [{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python",
"PYSPARK_DRIVER_PYTHON": "python",
"SPARK_YARN_USER_ENV": "python"
}
}]
I am certain that my shell script is importing the modules because when I run the following on the EMR command line (via SSH) it works:
python3.6
import boto3
However when I run the following, it does not work:
python
import boto3
Traceback (most recent call last): File "", line 1, in ImportError: No module named boto3
When I run the following command in Jupyter I get the output below:
import sys
import os
print(sys.version)
2.7.16 (default, Jul 19 2019, 22:59:28) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
#!/bin/bash
alias python=python3.6
export PYSPARK_DRIVER_PYTHON="python"
export SPARK_YARN_USER_ENV="python"
sudo python3 -m pip install boto3
sudo python3 -m pip install pandas
sudo python3 -m pip install pymysql
sudo python3 -m pip install xlrd
sudo python3 -m pip install pymssql
When I attempt to import boto3 I get an error message using Jupyter:
No module named boto3 Traceback (most recent call last): ImportError: No module named boto3