How to delete the most recently created files in multiple HDFS directories?

Question

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.

For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls

/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801

I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.

The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.

I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.

I could be wrong, but I'm afraid I don't think there's a one-liner for this. Should only be a few lines of Bash though. — Ben Watson, Nov 15 '19 at 13:37
does all of your new files have same timestamp ? or timestamp of new files is greater than some specified time ? — Strick, Nov 15 '19 at 14:06
@Strick yes, exactly - I've been able to make some progress using: `hdfs dfs -ls /tmp | sort -k6,7` from this post: https://stackoverflow.com/questions/37022749/is-there-a-hdfs-command-to-list-files-in-hdfs-directory-as-per-timestamp So I now have my list of specific files I need to remove, and I'm now trying to find a way to bulk remove a pre-created list of files — phenderbender, Nov 15 '19 at 14:11
I have posted and answer please check if it solves your purpose — Strick, Nov 15 '19 at 14:26

Strick · Accepted Answer · 2019-11-15T14:32:39.233

2

AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file

like this

hdfs dfs -ls /tmp | sort -k6,7 > files.txt

Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100

then you can update your command to

hdfs dfs -ls /tmp | sort -k6,7 | head -100 |  awk '{print $8}' > files.txt

or if you know specific timestamp of your new files then you can try below command

hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt

Then read that file and delete all files one by one

while read file; do
  hdfs -rm $file
  echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

done <files.txt

So you complete script can be like

#!/bin/bash

 hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt

 while read file; do
     hdfs -rm $file
     echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

   done <files.txt

edited Nov 15 '19 at 14:32

answered Nov 15 '19 at 14:26

Strick

1,512
9
15

looks like your solution using `head -100` is getting me even closer. I know exactly how many files I need to delete (the most recent 181 files) and with the way it sorted, i'm actually having to use `tail -181`. And so if I know the exact number of files, I don't think I need to grep for the timestamp, is that correct? so...for my purposes, does this look appropriate? – phenderbender Nov 15 '19 at 14:38
`#!/bin/bash hadoop fs -ls /my_directory/* | sort -k6,7 | tail -181 > files.txt while read file; do hadoop fs -rm $file echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted done – phenderbender Nov 15 '19 at 14:38
1

yes if you know exact files then no need to grep. but for safe side just check if the output of "hdfs dfs -ls /tmp | sort -k6,7 | grep "" | wc -l" is also or 180 or so .. that is for rechecking only ..rest you can proceed with your tail approach – Strick Nov 15 '19 at 14:41
1

please see i am also using awk '{print $8}' to get exact file path instead of other columns you need to validate this also .. so update this also in your command – Strick Nov 15 '19 at 14:43
I am unfamiliar with the formatting for grep but my timestamps are either `2019-11-14 21:03` or `2019-11-14 21:04` all 181 files were added within this two second range – phenderbender Nov 15 '19 at 14:44
1

hdfs dfs -ls /tmp | sort -k6,7 | grep "2019-11-14 21:03" | wc -l similarly for hdfs dfs -ls /tmp | sort -k6,7 | grep "2019-11-14 21:04" | wc -l and check if the some of both output comes 181 – Strick Nov 15 '19 at 14:47
1

Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202418/discussion-between-strick-and-phenderbender). – Strick Nov 15 '19 at 14:48

How to delete the most recently created files in multiple HDFS directories?

1 Answers1