0

I need to automate pyspark scripts to execute on an existing AWS EMR cluster for a client. The constraints are:

  1. No ssh access to the cluster's head node
  2. Can't create any EC2 instances
  3. Others in my group add their code to the Steps tab for the running cluster
  4. I have read/write access to S3
  5. The cluster remains in a running state; no need to script its stand-up or tear-down
  6. I have PyCharm pro

I reviewed this SO post, which is close to what I am after. Ideally, I would use Python with boto3 with PyCharm to pass the PySpark code fragment to their long-running cluster. What would others do here?

RandyB
  • 133
  • 1
  • 3
  • 14
  • The answer you linked is using python and boto3. Can you elaborate on what you need different from that answer? – jordanm Nov 23 '21 at 16:17
  • If you follow the first link in that post, it describes the use of Airflow as part of the workflow and the first step is to stand-up and EC2 instance. The client is a bank. I can't even install Firefox on the virtual desktop they provision. – RandyB Nov 24 '21 at 14:52

0 Answers0