I'm working with rapidminer to extract rules from a big dataset. Radoop the extension to hadoop ecosystem and the sparkRM operator allow proceeding the fp-growth from retrieving data from hive to exploring the analysis. I'm working on: -windows 8.1 -hadoop 6.2 -spark 1.5 -hive 2.1 I have configured the spark-default-conf as follow:
# spark.master yarn
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 2G
# spark.driver.cores 1
# spark.yarn.driver.memoryOverhead 384MB
# spark.yarn.am.memory 1G
# spark.yarn.am.cores 1
# spark.yarn.am.memoryOverhead 384MB
# spark.executor.memory 1G
# spark.executor.instances 1
# spark.executor.cores 1
# spark.yarn.executor.memoryOverhead 384MB
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
The yarn-site Xml file i have :
<property>
<name>yarn.resourcemanager.schedular.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.resourcemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>localhost:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/E:/tweets/hadoopConf/userlog</value>
<final>true</final>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/E:/tweets/hadoopConf/temp/nm-localdir</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>3</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
/tweets/hadoop/,
/tweets/hadoop/share/hadoop/common/*,
/tweets/hadoop/share/hadoop/common/lib/*,
/tweets/hadoop/share/hadoop/hdfs/*,
/tweets/hadoop/share/hadoop/hdfs/lib/*,
/tweets/hadoop/share/hadoop/mapreduce/*,
/tweets/hadoop/share/hadoop/mapreduce/lib/*,
/tweets/hadoop/share/hadoop/yarn/*,
/tweets/hadoop/share/hadoop/yarn/lib/*
/C:/spark/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
</value>
</property>
</configuration>
The quick connection test to the Hadoop is completed successfully. when I run the rapidminer process, it's finished by an error:
Process failed before getting into running state. this indicates that an error occurred during submitting or starting the spark job or writing the process output or the exception to the disc. Please check the logs of the spark job on the YARN Resource Manager interface for more information about the error.
in localhost:8088 I have this diagnostics informations enter image description here
this is the scheduler of the job enter image description here
I'm new with Hadoop and spark and I can't configure the memory in an efficient manner.