31

I just downloaded Hortonworks sandbox VM, inside it there are Hadoop with the version 2.7.1. I adding some files by using the

hadoop fs -put /hw1/* /hw1

...command. After it I am deleting the added files, by the

hadoop fs -rm /hw1/*

...command, and after it cleaning the recycle bin, by the

hadoop fs -expunge

...command. But the DFS Remaining space not changed after recyle bin cleaned. Even I can see that the data was truly deleted from the /hw1/ and the recyle bin. I have the fs.trash.interval parameter = 1.

Actually I can find all my data split in chunks in the /hadoop/hdfs/data/current/BP-2048114545-10.0.2.15-1445949559569/current/finalized/subdir0/subdir2 folder, and this is really surprises me, because I expect them to be deleted.

So my question how to delete the data the way that they really will be deleted? After few adding and deletion I got exhausted free space.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
serg
  • 1,003
  • 3
  • 16
  • 26
  • It means the `namenode` deleted the metadata but the `datanode` didn't delete the data. Check your `namenode` and `datanode` logs for errors or warnings. Try running `hdfs dfsadmin -report` and see if you get any useful information. – alvits Dec 07 '15 at 18:50
  • Also it will take some time to perform the bookkeeping. – Durga Viswanath Gadiraju Dec 08 '15 at 03:05
  • Hadoop moves the content to the thrash directory on -rm command. If you want to delete folders permanently then you have to use the command `hadoop fs -rm -skipTrash /hw1/*` – Shivanand Pawar Dec 08 '15 at 05:31
  • @ShivanandPawar it's not exactly true because files in /trash directory deletes after number of minutes specified in `fs.trash.interval` property. Futhermore topicstarter used `hadoop fs -expunge` which permanently deleted files from trash. – maxteneff Dec 08 '15 at 09:41
  • @maxteneff My bad. Thanks a lot for pointing that out. – Shivanand Pawar Dec 08 '15 at 09:59

6 Answers6

20

You can use

hdfs dfs -rm -R /path/to/HDFS/file

since hadoop dfs has been deprecated.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
17

Try hadoop fs -rm -R URI

-R option deletes the directory and any content under it recursively.

BruceWayne
  • 3,286
  • 4
  • 25
  • 35
14

Your problem is inside of the basis of HDFS. In HDFS (and in many other file systems) physical deleting of files isn't the fastest operations. As HDFS is distributed file system and usually replicate at least 3 replicas on different servers of the deleted file then each replica (which may consist of many blocks on different hard drives) must be deleted in the background after your request to delete the file.

Official documentation of Hadoop tells us the following:

The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

maxteneff
  • 1,523
  • 12
  • 28
5

what works for me :

hadoop fs -rmr -R <your Directory>
Flowra
  • 1,350
  • 2
  • 16
  • 19
5

If you also need to skip trash following command works for me

hdfs dfs -rm -R -skipTrash /path/to/HDFS/file
Karol
  • 51
  • 1
  • 2
1

Durga Viswanath Gadiraju is right it is question of time, maybe my PC is slow, and also uses VM, after 10 minutes files are physically deleted, if you are using the algorythm that used by me in the question. Note set up the fs.trash.interval parameter = 1. Or by default files won't be deleted faster than 6 hours.

serg
  • 1,003
  • 3
  • 16
  • 26