I have setup Hadoop cluster with 20GB RAM and 6 cores. I have around 8 GB data in 3 csv files and I have to join them. For this purpose, I have used Apache Hive for this. Hadoop, Hive are 3.x version.
Here is the Hive query
SELECT distinct rm.UID,rm.Num_Period , rpd.C_Mon-rpd.Non_Cred_Inputs as Claimed_Mon, rpd.Splr_UID, rpd.Doc_Type,rpd.Doc_No_Num ,rpd.Doc_Date, rpd.Purchased_Type,rpd.Rate_ID, rpd.C_Withheld, rpd.Non_Creditable_Inputs, rsd.G_UID , rsd.G_Type,rsd.Doc_Type as G_doc_type, rsd.Doc_No_Num as G_doc_no_num, rsd.Doc_Date as G_doc_date, rsd.Sale_Type as G_sale_type, rsd.Rate_ID as G_rate_id, rsd.Rate_Value as G_rate_value,rsd.hscode as G_hscode
from ZUniq rm inner join Zpurchasedetails rpd
on rm.UniqID = rpd.UniqID
inner join Zsaledetails rsd on rpd.UniqID = rsd.UniqID
where rpd.Non_Cred_Inputs < rpd.C_Mon;
Now,there is around 300 GB disk free on one node and 400 GB on other. When I run above query, all disks are used and then job goes to pending with a message that no healthy node exits.
Here is the Hadoop configuration
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
</property>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/mnt/disk1/.hdfs/tmp</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hms-master</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>5184000</value>
<description>Delete the logs after 60 days </description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<!-- Logging related option -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hms-master:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
<description>Total RAM that can be used in single system by all containers.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>10240</value>
<description>Maximum RAM that one continer can get </description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum RAM that one continer (e.g. map or reduce) can get. It should be less or equal to yarn.nodemanager.resource.memory-mb value </description>
</property>
</configuration>