2

I am trying to submit a job to EMR cluster via Livy. My Python script (to submit job) requires importing a few packages. I have installed all those packages on the master node of EMR. The main script resides on S3 which is being called by the script to submit job to Livy from EC2. Everytime I try to run the job on a remote machine (EC2), it dies stating Import Errors(no module named [mod name] )

I have been stuck on it for more than a week and unable to find a possible solution. Any help would be highly appreciated. Thanks.

Shweta
  • 135
  • 7
  • The libs are required by executors as well, so all of these need to be present on all the core nodes. One clean way of doing it is to use bootstrap script. There are lots of resources online on how to do that. Also, you can check my answer here: https://stackoverflow.com/a/57408712/4245859 – Bitswazsky Apr 17 '20 at 11:42

1 Answers1

2

These packages that you are trying to import. Are they custom packages ? if so how did you package them. Did you create a wheel file or zip file and specify them as --py-files in your spark submit via livy ?

Possible problem.

You installed the packages only on the master node. You will need to log into your worker nodes and install the packages there too. Else when u provision the emr , install the packages using bootstrap actions

You should be able to add libraries via —py-files option, but it’s safer to just download the wheel files and use them rather than zipping anything yourself.

Emerson
  • 1,136
  • 1
  • 6
  • 9
  • No they are standard libraries like sqlalchemy and psycopg2. Also, I zipped them and passed as pyfiles argument in Livy submit. – Shweta Apr 01 '20 at 19:37
  • 1
    If this is emr, you are better of installing them via bootstrap actions so that it is a 1 time activity since the library will not change. If it’s a custom library then that’s the time you would actually pass them as —py-files. Either way...if you want to go this way, try downloading the whl files for these libraries and then using them. Remember when you install packages on master mode it does not make them available on all the worker nodes. – Emerson Apr 01 '20 at 19:42