I'm running a pyspark code stored on the master node on an AWS EMR cluster (1 master and 2 slaves each with 8GB RAM and 4 cores) with the command -
spark-submit --master yarn --deploy-mode cluster --jars /home/hadoop/mysql-connector-java-5.1.45/mysql-connector-java-5.1.45-bin.jar --driver-class-path /home/hadoop/mysql-connector-java-5.1.45/mysql-connector-java-5.1.45.jar --conf spark.executor.extraClassPath=/home/hadoop/mysql-connector-java-5.1.45/mysql-connector-java-5.1.45.jar --driver-memory 2g --executor-cores 3 --num-executors 3 --executor-memory 5g mysql_spark.py
There are 2 things that I notice:
- I SSH into the slave nodes and I notice that one of the slave nodes is not being used at all (used htop for this). Attaching a screenshot. This is how it looked like throughout. Is there something wrong with my
spark-submit
command? 2 slave nodes screenshot
- Before the application was submitted, 6.54GB of 8GB of master node's RAM was already in use(used htop again). There are no other applications running. Why is this happening?