0

I have setup Hadoop cluster with 20GB RAM and 6 cores. I have around 8 GB data in 3 csv files and I have to join them. For this purpose, I have used Apache Hive for this. Hadoop, Hive are 3.x version.

Here is the Hive query

SELECT distinct rm.UID,rm.Num_Period , rpd.C_Mon-rpd.Non_Cred_Inputs as Claimed_Mon,  rpd.Splr_UID, rpd.Doc_Type,rpd.Doc_No_Num ,rpd.Doc_Date, rpd.Purchased_Type,rpd.Rate_ID,  rpd.C_Withheld, rpd.Non_Creditable_Inputs,  rsd.G_UID , rsd.G_Type,rsd.Doc_Type as G_doc_type, rsd.Doc_No_Num as G_doc_no_num, rsd.Doc_Date as G_doc_date, rsd.Sale_Type as G_sale_type, rsd.Rate_ID as G_rate_id, rsd.Rate_Value as G_rate_value,rsd.hscode as G_hscode  
from ZUniq rm  inner join Zpurchasedetails rpd 
on rm.UniqID = rpd.UniqID  
inner join Zsaledetails rsd on rpd.UniqID = rsd.UniqID  
where rpd.Non_Cred_Inputs < rpd.C_Mon;

Now,there is around 300 GB disk free on one node and 400 GB on other. When I run above query, all disks are used and then job goes to pending with a message that no healthy node exits.

Here is the Hadoop configuration

yarn-site.xml


<configuration>
    
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
      
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
  </property>

<!-- Site specific YARN configuration properties -->
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/mnt/disk1/.hdfs/tmp</value>
</property>
<property>
      <name>yarn.resourcemanager.hostname</name>
      <value>hms-master</value>
</property>
<property>
      <name>yarn.log-aggregation.retain-seconds</name>
      <value>5184000</value>
      <description>Delete the logs after 60 days </description>
</property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>3</value>
  </property>

  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>

<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>
 <property>
   <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
    <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
  </property>

<!-- Logging related option -->
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
<property>
         <name>yarn.log.server.url</name>
         <value>http://hms-master:19888/jobhistory/logs</value>
</property>

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
    <value>10240</value>
    <description>Total RAM that can be used in single system by all containers.</description>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>10240</value>
    <description>Maximum RAM that one continer can get </description>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    <description>Minimum RAM that one continer (e.g. map or reduce) can get. It should be less or equal to yarn.nodemanager.resource.memory-mb value </description>
</property>

</configuration>
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121

0 Answers0