0

I have a mrjob configuration that includes loading a large file from s3 into HDFS. I would like to include these commands in the configuration file, but it seems that all bootstrap commands execute on all of the nodes in the cluster. This is over-kill and might also create synchronization problems.

Is there some way to include startup commands for the master node only in the mrjob configuration or is the only solution to SSH into the head node after the cluster is up to perform these operations?

Yoav

1 Answers1

0

Well, you could have your steps start with a mapper and set mapred.map.tasks=1 in your jobconf. I've never tried it, but seems like it should work.

Another suggestion:
Use a filesystem or zookeeper for coordination:

if get_exclusive_lock_on_resource(filesystem_path_or_zookeeper_path):
    Do the expensive bit
    release_lock(filesystem_path_or_zookeeper_path)

if expensive_bit_not_complete():
    sleep 10
Cargo23
  • 3,064
  • 16
  • 25