0

I am using an EMR notebook attached to my cluster for some experimentation purposes. I needed to install some python modules for testing, specifically spacy and it's data module en_core_web_sm.

I ssh'ed into the master and core nodes and downloaded the modules individually. However I am not able to import from the my EMR notebook. I get the following error :

An error was encountered:
No module named 'spacy'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'spacy'

I know there is a way to install them just for the scope of EMR notebook, but this wouldn't suffice in a production scenario, so please avoid answers which suggest notebook installing as mentioned in this guide : https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Please let me know if I am missing some setup steps. Appreciate your response.

Rahul Patwa
  • 117
  • 1
  • 12

2 Answers2

3

You can use bootstraps to install additional modules while creating your EMR https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

srikanth holur
  • 760
  • 4
  • 11
  • Yes, but that doesn't explain why I am not able to access modules which I've manually installed on the cluster. (The EMR bootstrap script would run exactly the same commands I think) – Rahul Patwa Jun 19 '20 at 20:14
  • Are you sure that the old instances were not terminated and new ones were not added to your EMR? – srikanth holur Jun 19 '20 at 20:17
  • No I didn't terminate any instances. This a running cluster and all instances are unchanged. – Rahul Patwa Jun 19 '20 at 20:26
  • Can you try importing your module in python shell on each node to make sure it's installed properly? – srikanth holur Jun 19 '20 at 20:47
  • Yeah, I just tried opening up the python3 shell and the import command works on master and cores – Rahul Patwa Jun 19 '20 at 21:01
  • I am not sure how to do that on an EMR notebook. I tried submitted the script directly to the master node of the EMR cluster using this command : [hadoop@ip-10-0-1-162 ~]$ spark-submit E2ECompleteTest.py --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3 --master yarn --deploy-mode cluster I followed this post : https://stackoverflow.com/questions/29972565/how-to-specify-the-version-of-python-for-spark-submit-to-use and this documentation : https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties Am I doing it wrong? – Rahul Patwa Jun 19 '20 at 22:13
  • https://aws.amazon.com/premiumsupport/knowledge-center/emr-pyspark-python-3x/ . Look at this answer as well: https://stackoverflow.com/questions/57512577/how-to-set-jupyter-notebook-to-python3-instead-of-python2-7-in-aws-emr – srikanth holur Jun 19 '20 at 23:01
  • 1
    Hey Srikanth, I was able to solve this problem as mentioned in my answer below. I didn't get a chance to try your solution but I'll give it a go. Thanks for your patience and persistence! – Rahul Patwa Jun 22 '20 at 17:12
0

I was able to solve this by changing the bootstrap script to use sudo instead of --user. (You could also manually change run the scripts below)

Before I was running

pip3 install spacy --user
python3 -m spacy download en --user

I changed that script to

sudo pip3 install spacy
sudo python3 -m spacy download en

To verify this solution quickly issue the following commands from your EMR notebook (to compare before and after)

sc.list_packages()

You should see an output similar to this

SparkSession available as 'spark'.
Package                    Version   
-------------------------- ----------
beautifulsoup4             4.9.0     
blis                       0.4.1     
boto                       2.49.0    
catalogue                  1.0.0     
certifi                    2020.4.5.2
chardet                    3.0.4     
cymem                      2.0.3     
en-core-web-sm             2.3.0     
idna                       2.9       
importlib-metadata         1.6.1     
jmespath                   0.9.5     
lxml                       4.5.0     
murmurhash                 1.0.2     
mysqlclient                1.4.2     
nltk                       3.4.5     
nose                       1.3.4     
numpy                      1.16.5    
pip                        9.0.1     
plac                       1.1.3     
preshed                    3.0.2     
py-dateutil                2.2       
python37-sagemaker-pyspark 1.3.0     
pytz                       2019.3    
PyYAML                     5.3.1     
requests                   2.24.0    
setuptools                 28.8.0    
six                        1.13.0    
soupsieve                  1.9.5     
spacy                      2.3.0     
srsly                      1.0.2     
thinc                      7.4.1     
tqdm                       4.46.1    
urllib3                    1.25.9    
wasabi                     0.6.0     
wheel                      0.29.0    
windmill                   1.6       
zipp                       3.1.0

This is not the best possible solution IMO, since the first warning that gets displayed after using sudo is

WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.

If anyone has a better solution please free to post.

Rahul Patwa
  • 117
  • 1
  • 12