Simple Hive Job with 8 GB csv data consumes all disk (around 500GB)

Question

I have setup Hadoop cluster with 20GB RAM and 6 cores. I have around 8 GB data in 3 csv files and I have to join them. For this purpose, I have used Apache Hive for this. Hadoop, Hive are 3.x version.

Here is the Hive query

SELECT distinct rm.UID,rm.Num_Period , rpd.C_Mon-rpd.Non_Cred_Inputs as Claimed_Mon,  rpd.Splr_UID, rpd.Doc_Type,rpd.Doc_No_Num ,rpd.Doc_Date, rpd.Purchased_Type,rpd.Rate_ID,  rpd.C_Withheld, rpd.Non_Creditable_Inputs,  rsd.G_UID , rsd.G_Type,rsd.Doc_Type as G_doc_type, rsd.Doc_No_Num as G_doc_no_num, rsd.Doc_Date as G_doc_date, rsd.Sale_Type as G_sale_type, rsd.Rate_ID as G_rate_id, rsd.Rate_Value as G_rate_value,rsd.hscode as G_hscode  
from ZUniq rm  inner join Zpurchasedetails rpd 
on rm.UniqID = rpd.UniqID  
inner join Zsaledetails rsd on rpd.UniqID = rsd.UniqID  
where rpd.Non_Cred_Inputs < rpd.C_Mon;

Now,there is around 300 GB disk free on one node and 400 GB on other. When I run above query, all disks are used and then job goes to pending with a message that no healthy node exits.

Here is the Hadoop configuration

yarn-site.xml


<configuration>
    
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
      
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
  </property>

<!-- Site specific YARN configuration properties -->
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/mnt/disk1/.hdfs/tmp</value>
</property>
<property>
      <name>yarn.resourcemanager.hostname</name>
      <value>hms-master</value>
</property>
<property>
      <name>yarn.log-aggregation.retain-seconds</name>
      <value>5184000</value>
      <description>Delete the logs after 60 days </description>
</property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>3</value>
  </property>

  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>

<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for containers</description>
  </property>
 <property>
   <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
    <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
  </property>

<!-- Logging related option -->
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
<property>
         <name>yarn.log.server.url</name>
         <value>http://hms-master:19888/jobhistory/logs</value>
</property>

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
    <value>10240</value>
    <description>Total RAM that can be used in single system by all containers.</description>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>10240</value>
    <description>Maximum RAM that one continer can get </description>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    <description>Minimum RAM that one continer (e.g. map or reduce) can get. It should be less or equal to yarn.nodemanager.resource.memory-mb value </description>
</property>

</configuration>

are there some non-unique UniqID which can cause duplication after join? — leftjoin, Jan 24 '22 at 15:22
UniqID is the primary key. I have i ported data from MSSQL where it is primary key in all tables — Hafiz Muhammad Shafiq, Jan 24 '22 at 15:57
How did you "port" the data? Using Sqoop? If so, why did you use CSV instead of Parquet or ORC? — OneCricketeer, Jan 24 '22 at 16:31
I think @OneCricketeer is on point here. Parquet/Orc would provide predicate pushdown which could reduce data being used. (As well as just being faster and better use of space) — Matt Andruff, Jan 24 '22 at 18:47
How much space is allocated to: `/mnt/disk1/.hdfs/tmp` ? You are saying this has 300G and 400G drives respectively? There is only 1 drive per machine? — Matt Andruff, Jan 24 '22 at 18:50
Disk is of 1 TB. I have been given exported csv from MSSQL. Is there any benefit of parquet format. — Hafiz Muhammad Shafiq, Jan 25 '22 at 01:05
I have converted CSV to parquet but some problem is repeated here — Hafiz Muhammad Shafiq, Jan 25 '22 at 13:57

Simple Hive Job with 8 GB csv data consumes all disk (around 500GB)

0 Answers0