24

I am trying to understand where hadoop stores data in HDFS. I refer to the config files viz: core-site.xml and hdfs-site.xml

The property that I have set is:

  • In core-site.xml:

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/hadoop/tmp</value>
    </property>
    
  • In hdfs-site.xml:

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/hadoop/hdfs/namenode</value>
    </property>
    
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/hadoop/hdfs/datanode</value>
    </property>
    

With the above arrangement, like dfs.datanode.data.dir, the data blocks should be stored in this directory. Is this correct?

I referred to the apache hadoop link, and from that i see this:

  • core-default.xml: hadoop.tmp.dir --> A base for other temporary directories.

  • hdfs-default.xml dfs.datanode.data.dir --> Determines where on the local filesystem an DFS data node should store its blocks.

    The default value for this property being -> file://${hadoop.tmp.dir}/dfs/data

Since I explicitly provided the value for dfs.datanode.data.dir (hdfs-site.xml), does it mean data would be stored in that location? If so, would dfs/data be added to the directory to ${dfs.datanode.data.dir}, specifically would it become -> /hadoop/hdfs/datanode/dfs/data?

However I didn't see this directory structure getting created.

One observation that I saw in my env:

I saw that after I run some MapReduce programs, this directory is created viz: /hadoop/tmp/dfs/data is getting created.

So, not sure if data gets stored in the directory as suggested by the property dfs.datanode.data.dir.

Does anyone have similar experience?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
CuriousMind
  • 8,301
  • 22
  • 65
  • 134

1 Answers1

14

The data for hdfs files will be stored in the directory specified in dfs.datanode.data.dir, and the /dfs/data suffix that you see in the default value will not be appended.

If you edit hdfs-site.xml, you'll have to restart the DataNode service for the change to take effect. Also remember that changing the value will eliminate the ability of the DataNode service to supply blocks that were stored in the previous location.

Lastly, above you have your values specified with file:/... instead of file://.... File URI's do need that extra slash, so that might be causing these values to revert to the defaults.

AdrieanKhisbe
  • 3,899
  • 8
  • 37
  • 45
RickH
  • 2,416
  • 16
  • 17
  • Thanks for your response. When i did the changes like adding extra '/' i am getting exception, at the time of formatting the file system. 14/03/22 00:18:27 FATAL namenode.NameNode: Exception in namenode join at java.io.File.(File.java:423) at org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageDirectory(NNStorage.java:324) Please let me know if i am missing something else. Thanks, Vipin – CuriousMind Mar 21 '14 at 18:59
  • 2
    I believe we have to use file:/// and not file://. – CuriousMind Mar 21 '14 at 19:06
  • 1
    You're right, you need 3 slashes for a file: URI. Didn't realize that the [format](http://tools.ietf.org/html/rfc1738) has room for a host name. i.e. `file:///` (And the 'host' part is optional, the 3 slashes aren't.) – RickH Mar 21 '14 at 19:19
  • The reason i saw /hadoop/tmp/dfs is due to the fact that dfs.namenode.checkpoint.dir (DFS Secondary name node) uses file://${hadoop.tmp.dir}/dfs/namesecondary by default, and i didn't change it. – CuriousMind Mar 21 '14 at 19:51
  • how would one go about to change the location of this directory, and retain the ability of the DataNode to supply the blocks stored at the original location? I'd imagine just copying the contents of the original folder to the new folder would not suffice, since the hdfs would not now of this change on its index. – jimijazz Sep 29 '15 at 14:00
  • found this to answer my own question: [link] (http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) – jimijazz Sep 30 '15 at 23:26