2

I have a requirement to delete a folder at hdfs containing a large number of files say 1,000,000. And this is not a one time task, this is my daily requirement. Currently I am using the code below:

    Configuration c=new Configuration();
FileSystem fs = FileSystem.get(c);
fs.delete(folder,true);

But the above is taking much more time approx 3 hours. Is there any way by which I can delete the entire folder very fast.

GregP
  • 143
  • 12
agarwal_achhnera
  • 2,582
  • 9
  • 55
  • 84
  • Worth a try: https://stackoverflow.com/questions/34140344/how-to-delete-files-from-the-hdfs (disabling the trash) –  Jul 12 '17 at 11:30
  • @RC. It is worth using `-skipTrash` when you are sure that the data is to be deleted permanently. Yet, the impact will most likely be marginal (if any). The option is basically useful mostly for over-quota directories. The trashing operation is implemented as a simple metadata operation, which completes fast regardless of the number of files in the directory, or the size of each. – Pierre Jul 12 '17 at 11:35
  • @RC Trash is already disabled with zero interval – agarwal_achhnera Jul 12 '17 at 11:40
  • I'm curious: would it help to move the folder (to some .tmp_garbage_dH65m8rq like name) before deleting it? You'd at least be able to create new files with the same name. – Caesar Nov 07 '19 at 01:53

1 Answers1

2

Simple answer: you can't.

Let me explain why. When you are deleting a folder, you are removing all references to all files (recursively) contained in it. The metadata about these files (chunk locations) is retained in the namenode.

The data nodes store data chunks, but have basically no idea about the actual files it corresponds to. Although you could technically remove all references to a folder from the namenode (which would make the folder appear as deleted), the data would still remain on the datanodes, which would have no way of knowing that the data is "dead".

As such, when you delete a folder, you have to reclaim first reclaim all memory from all data chunks that are spread across the whole cluster for all files. This can take a significant amount of time, but is basically unavoidable.

You could simply process deletions in a background thread. Although this won't help with the lengthy process, this would at least hide this process from the application.

Pierre
  • 6,084
  • 5
  • 32
  • 52
  • If I use some map-reduce program to delete these file, may be 500 files per mapper will it help. Or will be same as driver program will take time to load files – agarwal_achhnera Jul 12 '17 at 11:38
  • @agarwal_achhnera Why would you use MR to delete files? The point of it is to process the files, usually not to perform maintenance operations. – Pierre Jul 12 '17 at 11:44
  • Because lets suppose there are 1000000 files, now I create 1 mapper for each 500 files so total 2000 mappers will simultaneously send delete request for there 500 files. Not sure namenode will work so fast or not but just ask will it help or not. – agarwal_achhnera Jul 12 '17 at 12:07
  • Any MR job actually needs to hit the namenode to recover the location of all data chunks to process in the first place. It does not happen magically. Once again, if you are using MR for maintenance operations, you can usually assume that you are doing something very broken. You can trust HDFS developers to have implemented the `rmdir` command the fastest way that works with the architecture. – Pierre Jul 12 '17 at 12:15
  • Rather than write your own, you should consider running hadoop distcp -m NUM_MAPPERS -delete /EMPTYDIR /YOUR_DELETE_DIR and see if you get any speed up improvements. I think you should be able to speed things up but I don't know if the speedups will be significant because at some point you're just bottlenecked at the namenode. http://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html – tk421 Jul 13 '17 at 20:55