0

I need to delete over 100 million small (8k-300k) files from an SMB share using Ubuntu Linux 18.04

These files are in the following folder structure:

/year/month/day/device/filename_[1...10000].jpeg

Things I've tried:

rm -rf * - Obviously fails with a parameter overflow.

find . -type f -exec rm -f {} \; - This would work if I had all the time in the world but is way too slow

The only way I can think of being effective is to run multiple parallel jobs each deleting a subset of the data (until I saturate the NASs ability to cope)

I'm not sure, however, how to run parallel commands from Linux command line. I could use tmux and spawn many sessions, but that feels inelegant.

I guess I could also just put a & at the end of bunch of command lines.

Is there a way I could write this as a shell script that would spawn multiple find & erase jobs/processes?

Gavin Hill
  • 156
  • 1
  • 9
  • Depending on how many directories there are, one subprocess per directory might be a sensible choice. It should also be noted that this sort of operation is typically *much* more efficient when run locally rather than over SMB, so if there is any way of doing it from the server end, you should prefer that option. – Harry Johnston Dec 19 '19 at 08:42

3 Answers3

1

I’d consider either rm -r folder or find folder -delete. Definitely not find -exec as it invokes a command on EVERY file!

You can also use a character class wildcard to remove files, such as rm -rf a[a-h]* - or you could just remove your month or year folders one at a time. If you start several remove processes at once there’s a point at which more will actually make things slower.

I always like to avoid rm * just in case I’m in the wrong directory/folder. Also find . -name ‘xx*’ should be more tolerant of sizes.

When you’re done, remove and remake the directory, even if you have to move the files out, as it will shrink the size of the directory itself.

In terms of speed, ether rm -f or find -delete will be fast. If all the files are in a flat folder.

Another classic is find . -print | xargs rm -f - but this is really only useful when you use additional find modifiers.

Brian C
  • 111
  • 3
  • rm -r runs out of argument buffer space and dies, we opted for find folder -delete – Gavin Hill Dec 19 '19 at 14:59
  • It’s also worth being aware that large directories slow down file operations - to open a file, the whole directory may need to be read. Traditional rule of thumb had been around 6,000 entries Ina directory though with a modern file system I’m sure you could go higher (and research should be easy to google). Even at those limits the shell wildcard (*) expansion can have problems! – Brian C Dec 19 '19 at 22:40
1

We ended up going with a script that spawned parallel tasks

#!/bin/bash

# Spin up find/delete tasks
deleteTask(){
    echo "Contents of $2 - $1 is being deleted";
    time find $1 -delete;
    echo "Contents of $2 - $1 were deleted";
}

#Get the next layer down    
deleteFromSubFolder(){
for folder in $1*; do
  echo $1 - $folder
  deleteTask "$folder" "$1" &
done
}

#Start in the top layer
for folder in */; do
  deleteFromSubFolder "$folder" &
done
Gavin Hill
  • 156
  • 1
  • 9
0

You may try by removing folders find /year -type d -exec ls -ld {} \;.

replace ls -ld {} by rm -rf {} once you are sure of the result

Chaoxiang N
  • 1,283
  • 5
  • 11