Output of reducer sent to HDFS where as map output is stored in data node local disk?

Question

I am bit confused about HDFS storage and Data node storage. Below are my doubts.

Map function output will be saved to data node local disk and reducer output will be sent to HDFS. As we all know that data blocks are stored in data nodes local disk is there any other disk space available for HDFS in data node??
What is the physical storage location of reducer output file (part-nnnnn-r-00001) ? will it be stored in Name node hard disk?

So my assumption is data node is part is of HDFS i assume data node local disk is also part of HDFS.

Regards Suresh

score 4 · Answer 1 · answered Apr 22 '14 at 12:39

You must know the difference between virtual concept and the actual storage. HDFS (Hadoop Distributed File System) just specifies how data will be stored in datanodes. When you say store a file in HDFS it means that it will be virtually considered as an HDFS file but actually stored in the disk of a datanode.

Let's see in details how does it work:

HDFS as a block-structured file system: it will break individual files into blocks of a fixed size(by default 64 Mbytes). These blocks are stored across a cluster of machines composed of one namenode and several datanodes.
The nameNode handles the metadata structures (e.g., the names of files and directories) and regulates access to files it also executes operations like open/close/rename. To open a file, a client contacts the NameNode and retrieves a list of locations for the blocks that comprise the file. These locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers, possibly in parallel. The NameNode is not directly involved in this bulk data transfer, keeping its overhead to a minimum.
DataNodes will bee responsible for serving read/write requests and block creation/deletion/replication. So every block in the HDFS system is actually stored in dataNode.

With pleasure, I hope it was useful. – Mouna Apr 24 '14 at 08:13 — Mouna, Apr 24 '14 at 08:13

score 0 · Answer 2 · answered Apr 22 '14 at 18:48

to answer your question,

first of all we need to understand that mapping and reducing job performs at some data node choose by namenode. All the nodes are part of HDFS it self.

So, when we say that "Map function output will be saved to data node local disk", that means that after performing mapping, that particular datanode keeps data at local disk, hidden from local file system say unix. It wait for reducer to read it and perform reducing phase. Mapper's datanode keep data save unto the job is completed.

Now, reducer (some datanode choose by namenode), performs reducing phase.
As per my understanding at time of writing map reduce job, we give output path. under that path it self part-nnnnn-r-00001..1000 and logs resides.

Thanks for the reply. Got it now as you said in first case it hides from the local file system (ext3, ext4 or XFS). — Suresh Babu D.V, Apr 22 '14 at 21:07

Output of reducer sent to HDFS where as map output is stored in data node local disk?

2 Answers2