how to make aws elastic mapreduce hive commands run in parallel

Question

I reviewed here,

How to make hive run mapreduce jobs concurrently?

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

Also, is setting this option equivalent to doing something like the following?

cat hive_script.hql | parallel --gnu hive -e '{}'

My hive script can run in any order, as it is simply launching a bunch of jobs for each new (time based) partition of an existing table to create time based partitions of a derivative table.

If they are not equivalent in my case, will one of these strategies yield better performance?

score 2 · Accepted Answer · answered Jan 28 '14 at 06:53

2

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

add configuration into hive-site.xml (In my case, file path is ./.versions/hive-0.11.0/conf/hive-site.xml)

<property>
  <name>hive.exec.parallel</name>
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>

If they are not equivalent in my case, will one of these strategies yield better performance?

It's different. This property controls different stages paralle in one hive job, so performance depends on specific hive query.

answered Jan 28 '14 at 06:53

elprup

1,960
2
18
32

is there a way to tell a newly launching emr cluster to have this property, perhaps using the bootstrap actions? http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop – Patrick McCann Jan 28 '14 at 18:27
I found you can specify --hive-site on startup http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html – Patrick McCann Jan 29 '14 at 01:32

how to make aws elastic mapreduce hive commands run in parallel

1 Answers1