Using EMR 6.X series, how does one ensure that master tasks run on Core nodes? Reading this page it looks like all it takes are two parameters:
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'CORE'
However in my tests this doesn't work. Specifically, I am using mrjob, and here is my setup:
test.py
import time
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import RawValueProtocol, JSONProtocol
class Test(MRJob):
INPUT_PROTOCOL = RawValueProtocol
INTERNAL_PROTOCOL = JSONProtocol
def steps(self):
return [
MRStep(mapper=self.mapper,
reducer=self.reducer)
]
def mapper(self, _, fname):
# sleep so things stay "running" until we are ready to stop it
time.sleep(1000000000)
def reducer(self, key, datas):
time.sleep(10000000)
if __name__ == '__main__':
test = Test()
test.run()
mrjob.conf
runners:
emr:
region: us-east-1
subnet: subnet-...
ec2_key_pair: ...
ec2_key_pair_file: ...
ssh_tunnel: false
check_cluster_every: 30
cleanup: NONE
cleanup_on_failure: NONE
read_logs: false
add_steps_in_batch: true
cat_output: false
image_version: 6.3.0
instance_fleets:
- InstanceFleetType: MASTER
InstanceTypeConfigs:
- InstanceType: m6g.xlarge
TargetOnDemandCapacity: 1
- InstanceFleetType: CORE
TargetOnDemandCapacity: 1
InstanceTypeConfigs:
- InstanceType: m6g.xlarge
BidPriceAsPercentageOfOnDemandPrice: 100
WeightedCapacity: 1
- InstanceFleetType: TASK
TargetOnDemandCapacity: 0
TargetSpotCapacity: 1
InstanceTypeConfigs:
- InstanceType: m6g.xlarge
BidPriceAsPercentageOfOnDemandPrice: 100
WeightedCapacity: 1
bootstrap:
- sudo yum groupinstall "Development Tools" -y
- curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -o /tmp/miniconda.sh
- bash /tmp/miniconda.sh -b -p /home/hadoop/miniconda
- rm /tmp/miniconda.sh
- /home/hadoop/miniconda/bin/conda update conda -y
- /home/hadoop/miniconda/bin/conda install -c conda-forge -y -q pip
- /home/hadoop/miniconda/bin/pip install mrjob
bootstrap_python: false
bootstrap_mrjob: false
python_bin: /home/hadoop/miniconda/bin/python -W ignore
emr_configurations:
- Classification: hdfs-site
Properties:
dfs.replication: 1
- Classification: mapred-site
Properties:
mapreduce.map.memory.mb: 1000
mapreduce.reduce.memory.mb: 1000
mapreduce.job.reduces: 5
mapreduce.job.maps: 5
- Classification: yarn-site
Properties:
yarn.node-labels.enabled: true
yarn.node-labels.am.default-node-label-expression: 'CORE'
test.txt
1
2
3
and a command:
python test.py -r emr --conf-path mrjob.conf test.txt
I get this error:
Streaming Command Failed!
Command exiting with ret '5'
If I comment out the last four lines of mrjob.conf everything works. Any ideas what's going on? I would really like to constrain master tasks to core nodes.