2

Scenario:

With Locky virus on the rampage the computer center I work for have found the only method of file recovery is using tools like Recuva now the problem with that is it dumps all the recovered files into a single directory. I would like to move all those files based on there file extensions into categories. All JPG in one all BMP in another ... etc. i have looked around Stackoverflow and based off of various other questions and responses I managed to build a small bash script (sample provided) that kinda does that however it takes forever to finish and i think i have the extensions messed up.

Code:

#!/bin/bash
path=$2   # Starting path to the directory of the junk files
var=0     # How many records were processed
SECONDS=0 # reset the clock so we can time the event

clear

echo "Searching $2 for file types and then moving all files into grouped folders."

# Only want to move Files from first level as Directories are ok were they are
for FILE in `find $2 -maxdepth 1 -type f`
do
  # Split the EXT off for the directory name using AWK
  DIR=$(awk -F. '{print $NF}' <<<"$FILE")
  # DEBUG ONLY
  # echo "Moving file: $FILE into directory $DIR"
  # Make a directory in our path then Move that file into the directory
  mkdir -p "$DIR"
  mv "$FILE" "$DIR"
  ((var++))
done

echo "$var Files found and orginized in:"
echo "$(($diff / 3600)) hours, $((($diff / 60) % 60)) minutes and $(($diff % 60)) seconds."

Question:

How can i make this more efficient while dealing with 500,000+ files? The find takes forever to grab a list of files and in the loop its attempting to create a directory (even if that path is already there). I would like to more efficiently deal with those two particular aspects of the loop if at possible.

  • I think your question is "How can I make this faster?" and focusing on the `find` and the `mkdir` are your theories based on what you think you know about `mkdir` and what you've seen interactively watching the script execute. If you want to make it faster you should measure how fast these portions are in order to identify the true bottleneck(s). – Brian Cain Apr 01 '16 at 15:53
  • 1
    Unless you *know* that all the files to move have nice file names with no whitespace or characters with special meaning to the shell, your `for` loop is broken. – chepner Apr 01 '16 at 16:25
  • 1
    Running half a million `awk` processes isn't ideal. Use bash parameter substitution to get the extension. – Mark Setchell Apr 01 '16 at 16:30
  • @chepner I had a feeling that was the case as i had a ton of errors "cannot stat" and things like "cannot find" – Drew Malone Apr 01 '16 at 16:38
  • @MarkSetchell thanks for the info ill look into that. – Drew Malone Apr 01 '16 at 16:39

1 Answers1

2

The bottleneck of any bash script is usually the number of external processes you start. In this case, you can vastly reduce the number of calls to mv you make by recognizing that a large percentage of the files you want to move will have a common suffix like jpg, etc. Start with those.

for ext in jpg mp3; do
    mkdir -p "$ext"
    # For simplicity, I'll assume your mv command supports the -t option
    find "$2" -maxdepth 1 -name "*.$ext" -exec mv -t "$ext" {} +
done

Use -exec mv -t "$ext" {} + means find will pass as many files as possible to each call to mv. For each extension, this means one call to find and a minimum number of calls to mv.

Once those files are moved, then you can start analyzing files one at a time.

for f in "$2"/*; do
    ext=${f##*.}
    # Probably more efficient to check in-shell if the directory
    # already exists than to start a new process to make the check
    # for you.
    [[ -d $ext ]] || mkdir "$ext"
    mv "$f" "$ext"
done

The trade-off occurs in deciding how much work you want to do beforehand identifying the common extensions to minimize the number of iterations of the second for loop.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • +1 for Efficiency taking into account i don't need some files i could adapt this to remove all DLL files removing a block of files I don't need before we deal with them one-on-one – Drew Malone Apr 01 '16 at 16:49
  • With a single directory of files you don't even need `find` here at all. Just `mv -t tgt *.glob` will do (until the list of files gets too big and then `xargs` can be useful. – Etan Reisner Apr 01 '16 at 18:05
  • I kept `find` for the ability to dynamically decide how many files can be passed to each call of `mv` using `-exec ... +`. If I'm not mistaken, `xargs` is limited to specifying a fixed maximum number of arguments, regardless of the cumulative length of the arguments. – chepner Apr 01 '16 at 18:29
  • @EtanReisner the primary reason for making the script to handle it in the first place was `bash: /bin/mv: Argument list too long` – Drew Malone Apr 01 '16 at 18:48
  • @DrewMalone Right. Like I said you are fine until you hit that and then you can use `xargs` instead of `find` though both of them (used properly) are find (and actually it is slightly easier to use `find` safely/correctly by itself I think). – Etan Reisner Apr 04 '16 at 13:23
  • I think im going to have to tweak the second loop provided for some odd reason it moved all the folders into the root of my hdd. Not a good thing as i normaly work off the USB or on a CIFS share and dont need nor do i want all the extra files in my root drive. all the files were disapearing from my working directory so i set up an echo and this is what it gave me for the move `Moving file .//xlsx into //xlsx` Ill edit the loop and post back when i come up with a solution in the mean time anyone using this be careful as this is a bug that could cause a lot of damage. – Drew Malone Apr 05 '16 at 16:47