Hadoop cluster stuck hangs on Reduce > copy >

Question

So far for this issue I have tried solutions from here, 1, and here, 2. However, while these solutions do result in the mapreduce task being carried out, it would appear they only run on the name node as I get output similar to here, 3.

Basically, I am running a 2 node cluster with a mapreduce algorithm that I have designed myself. The mapreduce jar is executed perfectly on a single node cluster, which leads me to think that there is something wrong with my hadoop multi-node configuration. To set up multi-node, I followed the tutorial here.

To report what is going wrong, when I execute my program (after checking that namenodes, tasktrackers, jobtrackers, and Datanodes are running on the respective nodes) my program halts with this line in terminal:

INFO mapred.JobClient: map 100% reduce 0%

If I take a look at the logs for the task I see copy failed: attempt... from slave-node followed by a SocketTimeoutException.

Taking a look at the logs on my slave-node (DataNode) shows that the execution halts at the following line:

TaskTracker: attempt... 0.0% reduce > copy >

as the solutions in links 1 and 2 suggest, removing various ip addresses from the etc/hosts file results in successful execution, however I end up with items such as in link 4 in my slave-node (DataNode) log, for example:

INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201201301055_0381

WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201201301055_0381 being deleted.

This looks suspect to me, as a new hadoop user, but it may be perfectly normal to see this. To me this looks as though something was pointing to the incorrect ip address in the hosts file, and that by removing this ip address I simply halt execution on the slave-node, and processing continues on the namenode instead (which isn't really advantageous at all).

To sum up:

Is this output expected?
Is there a way I can see what was executed on what node post-execution?
Can anybody spot anything that I may have done wrong?

EDIT added hosts and config files for each node

Master: etc/hosts

127.0.0.1       localhost
127.0.1.1       joseph-Dell-System-XPS-L702X

#The following lines are for hadoop master/slave setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Slave: etc/hosts

127.0.0.1       localhost
127.0.1.1       joseph-Home # this line was incorrect, it was set as 7.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Master: core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/tmp</value>
    <description>A base for other temporary directories.</description>
</property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

Slave: core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hduser/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>

</configuration>

Master: hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Slave: hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Master: mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

Slave: mapre-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>

</configuration>

ah, forgot to mention that I'm running ubuntu, and as far as I understand, firewalls are disabled by default... see [here](https://help.ubuntu.com/lts/serverguide/firewall.html). I can copy data to the slave node via ssh, so firewall shouldn't be the issue? — planty182, Sep 05 '13 at 18:41
Please show you /etc/hosts and hadoop config files. I'm sure someone would help you. Thanks. — SSaikia_JtheRocker, Sep 06 '13 at 04:15
posted the info you asked for, but found the issue (I'm pretty sure) in the process... the slave hosts file was corrupt, setting it's own address as 7.0.1.1 instead of 127.0.1.1, changing this has fixed the issue. I posted all the info incase somebody else makes the same mistake... — planty182, Sep 06 '13 at 12:06
this could also be issue, take care of this also http://stackoverflow.com/questions/32511280/hadoop-1-2-1-multinode-cluster-reducer-phase-hangs-for-wordcount-program/32551259#32551259 — Bruce_Wayne, Sep 13 '15 at 20:16

score 2 · Accepted Answer · answered Sep 06 '13 at 12:12

The error is in etc/hosts:

During the erroneous runs, the slave etc/hosts file looked like this:

127.0.0.1       localhost
7.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

As you may have spotted, the ip address of this computer 'joseph-Home' was incorrectly configured. It was set to 7.0.1.1, when it should be set to 127.0.1.1. Therefore, changing the slave etc/hosts file, line 2, to 127.0.1.1 joseph-Home fixed the issue, and my logs appear normally on the slave node.

New etc/hosts file:

127.0.0.1       localhost
127.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

score 0 · Answer 2 · answered Jul 29 '14 at 08:15

0

Tested solution is to add below property to hadoop-env.sh and Restart All hadoop cluster services

hadoop-env.sh

export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"

answered Jul 29 '14 at 08:15

user3886907

345
5
7

score 0 · Answer 3 · answered May 09 '16 at 07:04

I also meet this problem today. The issue in my case is that the disk of one node in the cluster is full, so hadoop cannot write log file to local disk, so a possible solution to this problem can be deleting some unused files on the local disk. Hope it helps

Hadoop cluster stuck hangs on Reduce > copy >

EDIT added hosts and config files for each node

3 Answers3